MEASUREMENT and EVALUATION 


in 

EDUCATION 





I \ li M VC Mil 1 AN ( OMJ’ VN V 

N. I \\ )ltK • t II l< \ <> 

r>A J AIIaN 1\ • S A W 1 KA Ts < J S T> 

t IV 1)4 N • M i '% 1 1 \ 

HUT 11 M M 1 I 1 AN i 1 I' 

TvHt4 ) 



MEASUREMENT and EVALUATION 


in 


EDUCATION 


An introduction to its theory and practice 
at both the elementary and secondary school iei^els 








AMTS M BRADFIELD 

Professor of Education 
Sacramento State Coilegr 


STEWAR-^ MOREDOCK 

Associate Professor of Mathemotirs 
Sacfcmonto State College 


The Macmillan Company • New York 



PRINTED IN THE UNITED STATES OF AMERICA 
First Printing 


LIBRARY OF CONGRESS CATALOG CARD NUMBER- 57-6354 



PREFACE 


The concern of this book is what teachers, principals, supervisors, and 
other educational specialists need to know and do if they are to deal efficiently 
with measurement and evaluation. Among such items ol essential knowledge 
and performance are thought to be the following: 

]. The basic concepts of measurement and evaluation that underlie valid 
practice. 

2. The technical terminology involved. 

3. Phenomena that may deserve measurement and their measurable di- 
mensions. 

4. The nature of measurement symbols; the manj' procedures of measure- 
ment useful in the schools; and certain statistical ideas and operations im- 
portant for proper interpretation and use of test results. 

5. Standards appropriate to evaluating pupil achievement and efficient 
w^ays of reporting evaluations to pupils and [larents. 

6. How all these things apply to \our specialization as to subject, grade, 
'function, etc. 

Our treatment of these matters is based on a given rationale of measure- 
ment and evaluation and is intentionally developmental and analytic in char- 
acter. Passages and chapters are iiiterrclaicd and interdependent Definitions 
and principles developed in one chapter arc applied in subsequent ones. The 
first section deals with basic concepts, terminology, and the general features 
of dimensions, symbols, procedures, statistics, standards, marking, and re- 
porting. In the second section these concepts are applied to school subjects, 
to intelligence, and to character and personality variables. 

roiisequently, it is rcconimcndcd <hal each of the chapters in Section 1 
be studied carefully and in order, in advance of any reading in Section II. 
It is essential that the Oversaews be read for both sections. They explain the 
basic rationale of the sections and the nature and interrelationship of chapters. 

Because the text is an introductory and general one, the treatment is 
limited largely to things of instructional significance. Such matters of ad- 
ministrative importance as teacher rating, curriculum e\alualion. schciol plant 
appraisal, and community surveys are omitied. Moreover, the construction 
of standardized tests, the administration of individual intelligence tests, pro- 
jective techniques for personality measurement, and other very special psycho- 



vi 


PREFACE 


metric procedures are discussed only with a view to general understanding 
and not to practice. 

There are several exercises in each chapter designed to help you apply 
the principles and procedures discussed. In many cases you may wish to 
select from among them rather than do them all. 

The bibliography at the end of each chapter indicates the readings that 
have contributed to the ideas and techniques presented in the chapter, includ- 
ing specific references. The bibliography is not meant to be a list of suggested 
readings. Where additional reading seems advisable, appropriate titles arc 
indicated in the text. 

Footnotes are used for technical explanations, for suggested additional 
reading, for certain citations with no general relationship to educational 
measurement or evaluation, and for passages that may interest only a few 
readers. 

The appendix contains an extensive glossary of terms that have technical 
significance either in this text or in measurement and evaluation generally. 
While we have tried to define any unusual term the fust time it is mentioned, 
and very important terms repeatedly, you may wish to consult the glossary 
from time to time. Also in the appendix arc sample report cards and an anno- 
tated bibliography of published tests in all areas. The bibliography should be 
helpful in selecting tests for study and tor use. 

The appendix is concluded by two statistical tables The table ol normal 
curve arca-7 score relationships wilt be helpful in interpreting the confidence 
limits of various measurements. The other table compares graphically the 
several types of norm scores used in standardized tests. 



ACKNOWLEDGMENTS 


Textbooks arc written only with the assistance and cooperation of a great 
many people. Here we wish to mention a few of those to whom we are 
particularly indebted. 

Dr. H. Glenn Ludlow, Associate Professor of Education, University of 
Michigan, read the entire manuscript and provided numerous suggestions for 
its improvement. Dr. H. Orville Nordberg, Chairman of the Division of 
Teacher Education at Sacramento State College, read several chapters criti- 
cally. Many others of our colleagues at Sacramento State College assisted us 
with encouragement, advice, and materials. Among them are Dean Harold B 
Roberts, Dr. Palmer A. Graver, Dr. Edwin L. Klingclhofer. Mrs. Lucille 
Colby, and Dr. Charles F. Howard. 

Three stenographers typed the bulk of the manuscript, Miss Gerry Millard, 
Mrs. Louise Lowry, and Mrs. Carolyn Larsen. Mrs. Larsen, who worked on the 
final draft, was as valuable for her editing as for her typing. Wc are extremely 
grateful to Mrs. Frances Jones, a young artist, who drew the many graphs and 
•charts in the book. 

The specimens of student work to be found in Chapter V were provided 
by the following teachers at Sacramento Senior High School: Mr. Frederick 
W. Blodgett, Mr. Claicnee 1. Fiscus, Mrs. Helen LalTcrty, Mr. Ray E. Loehr, 
Mr. John P. Moore, Dr. Gerald G, Smith, and Mr. Gerry S. Watt. Sample 
report cards were sent to u^ I y numerous elementary and secondary schools 
in the Sacramento region 

Many publishers have permitted us to reproduce items from their publi- 
cations and, as in any textbook, wc have drawn on and referred to many 
professional books and articles. These arc all acknowledged as they occur. 

Finally, we wish to thank the several hundred students who were in our 
educational measurement clas ‘n during the last five years. They were the 
“guinea pigs” for the development of the book. On them we tried out new 
ideas and new approaches, and from them w'e hope wc learned something of 
what an introductory text in educational measurement should say and how 
it should be said. 

James M. Bradfield 
H. Stewart Moredock 

Sacramento State College 

yii 




CONTENTS 


I. Fundamental Conceptions and Procedures: Overview 1 

1 FORMS or MrASUREMFNT SYMBOI S 5 

2 PREHARING PHLNOMENA I OR MFASORrMENT 17 

3 PROCEDURIS OF MFASLREMENT IN GI NLRAL 34 

4 OBSERVATION 48 

5 PRODUCT ANALYSIS AND FRLl -RESPONSE PROCEDLRLS 63 

6 GUIDED RLSPONSr PROCIDURI S 84 

7 STATISTICAL DESCRIPTIONS 01 MI ASL RFMFNT DAI A 130 

8 lURIHFR STAIISIICAI CONCIPTS IN MI XSURLMFNT 163 

9 EVALUATIVE SIANDARDS —MARKING AND RLPORIING 

ACHIFVIMINT 190 


II. Customary Uses of Measurement and Evaluation in Education. 


Overview 215 

10 LANGUAGE ARTS 219 

1 1 SOCIAL STl DIFS 266 

12 SCIENCE AND MATHI MAIICS 294 

13 PLRIORMANCE ACTIVITY ARIAS 329 

14 INTLILIGENCF 361 

15 PFRSONAI ITY AND C hVRAC TI R 386 

16 SCHOOL-WIDE TESTING PROGRAMS 424 


Appendix 

A 01 OSSARY or TERMS IISID IN MI VSURLMFNT VND 


EVAl UATION 441 

B BIBLIOGRAPHY or SFI ECT ED PUBl ISHID I LSI S 462 

C SAMPLES OF RI PORT CARDS AND RECORD FORMS 494 

D NORMAI PROBABILITY AND DERIVfD SCORE TABUS 498 

Index 501 




TABLES 


1. Some Pupil Products Particularly Amenable to Factor Count 

Scoring 79 

2. Types of Guided Response Hems Compared as to Chance Factor, 

Time for Administration, and Difficulty 102 

3. An Outline for Constructing a Frequency lablc 133 

4. An Outline tor Constructing a Histogram 135 

5. An Outline for Computing the Median ol a Distribution 143 

6. An Outline for Computing the Mean from the 

Formula — X„ ) 148 

7. An Outline tor the Computation ol the Standard Deviation 148 

8. An Outline lor Developing a Cumulative Percentage Curve 154 

9. Published Tests Listed for Language Arts Areas in the Fourth 

Mental Measurements Yearbook 221 

10. An Illustrative Plan of Measurement and Evaluation for Eighth- 

' Grade Social Studies 282 

11. Test of Attitudes Toward Government, Minority Groups, and 

School Life 285 

12. A Frcc-Rcsponse IcnI Scored According to an Analysis Schedule 286 

13. Observation Schedule for Study Skills and Attitudes 287 

14. Pupil Study Log 288 

15. Example of an Evaluative Standard for Science 300 

16. Examples of Procedures for Measuring Abilit> to Think 

Scientifically 302 

17. Example of an Evaluative Standard for ArithinciiL Computation 314 

18. Example of an Evaluative Standard for Understanding an 

Arithmetic C oncept 3 1 5 

19. Example of an Evaluative Standard for Logic 318 

20. Examples of Test Items for Arithmetic 320 

21. Examples of Test Items for Algebra 321 

22. Examples of Test Items for Geometry 322 

23. Rating Chart for Art Performance 345 

24. Check List for Driving Performance 346 

25. Musical Performance Tpst 347 

xi 



TABLES 


xii 


26 Rating Form for Applying Varmsh 349 

27 Plan for Judging a Typing Performance 349 

28 Outline for Rating a Gathered Skirt 350 

29 Plan for Evaluating a Diving Performance 351 

30 A Picture Articulation Test for Nonreaders 352 

31 Outline of Personality and Character Dimensions, Measuring 

Procedures, and Evaluative Standards 390 

32 Percentage of Area Under the Normal Curve Between Mean 

Ordinate and Ordinate at Given 2 Score 498 

33 Several Types of Score as Related to a Normal Probability 

Distnbution 499 



FIGURES 


1. Principles of Sampling 39 

2. Probable Relationship Between the Reliability or Validity of a 

Procedure and the Time and Energy Devoted to it 44 

3. Check List for Observation in Physical Education 52 

4. Anecdotal Record lor Physical Education 53 

5. Types of Rating Scale Items 56 

6. Examples of Pupil Products Scored by Various Methods 64 

7. Examples of Free-Response Items Scored by Several Methods 66 

8. Sample Pictures from the Murray Thematic Appciception Test 

and the Rorschach Ink Blot Test 71 

9. Examples of Guided Response Test Items Which Require Selection 

of an Answer 86 

10. Examples of Guided Response Items Which Require Provision of 

an Answer 87 

1 1 . Examples of Guided Response Items in Which Arrangement of 

Elements is Involved 88 

12. Measurement of Many-Degrecd Dimensions by a Scries ol 

Dichotomous Classifications 95 

13. Section of an Item Analysis Table 99 

14. Illustration of a Test Item Entered on a Card Together with an 

Analysis of Responses to the Item 100 

15. Example of a Strip Key and a Stencil Key 117 

16. A Test Profile Form 125 

17. Frequency Polygon 137 

18. Smoothed Frequency Curve 137 

19. Bell-Shaped or Normal Distribution 138 

20. Symmetrical Distribution 138 

21. Skewed Distribution 138 

22. J-Shaped or Poisson Distribution 138 

23. Rectangular Distribution 138 

24. U-Shaped Distribution 1 38 

25. Bimodal Distribution 138 

26. Median Defined in Terms of the Area of a Histogram 141 

27. Finding a Median Illustrated 141 

xiii 



xiv FIGURES 

28. Locating the Median in the Middle Interval 142 

29. Mean as a Point of Balance 145 

30. Illustrating the Mean as the Center of Balance of the Histogram 149 

31. The Effect of Skewed Distributions on the Mean and Median 149 

32. Two Distributions Having the Same Range but Different Variability 151 

33. Probabilities in Coin Tossing 164 

34. Normal Curve and Z Scale Relationship 166 

35. Portions of Area of Normal Distribution Contained Between Given 

Z Scores 1 66 

36. Probability of a Person’s Having a Given IQ 168 

37. Standard Error of an 10 Illustrated 170 

38. A Test Profile that Includes an Indication of Standard Errors of 

the Scores 172 

39. Illustrative Scattergrams for Different Degrees of Correlation 176 

40. Sample Items from Readiness Tests 226 

41. Types of Guided Response Items Used to Measure Various 

Dimensions of Reading Ability 231 

42. Grading Sheet from the College Entrance Board General 

Composition Test, Form F 241 

43. Examples of Guided Response Items Designed to Test a Pupil’s 

Knowledge of the Mechanics of Composition 243 

44. Examples of Guided Response Items and Techniques for Measuring 

Spelling Ability 245 

45. Two Examples of Guided Response Items Keyed to Rhetorical Skill 247 

46. How Evaluative Standards Seem to Operate for English 

Composition 249 

47. Samples from the Ayres Handwriting Scale 252 

48. Examples of Guided Response Items for Measuring Literary 

Knowledge 256 

49. A Profile Reporting Form for English or Language Arts 261 

50. Two Usual Types of Guided Response Items Used in Science 301 

51. Profile Page of California Test of Mental Maturity 373 

52. Illustrative Gucss-Who Items Used to Measure the Reputation 

Pupils have with their Peers 400 

53. Simplified Illustration of the Compilation of a Sociogram 401 

54. Illustrative Sociogram of Friendship Patterns 402 

55. Example of a Form to Report on Citizenship Items 404 

56. Example of a Legal Evaluative Standard 406 

57. Examples of Guided Response Items Used in Getting Pupils to 

Report on Their Own Interests and Attitudes 409 

58. Examples of Guided Response Items Used to Measure Pupil 

Opinion 411 

59. Examples of Three Types of Devices Used in Rating Personality- 

Character Variables after Observation 417 



SECTION I 


FUNDAMENTAL CONCEPTIONS 
AND PROCEDURES 


OVERVIEW 

In this first section of the book we wish to discuss certain conceptions and 
procedures that are basic to measuring and evaluating pupil achievement and 
other things of importance in the schools. As j^ou will find, all the many prin- 
ciples and processes described arc interrelated and pertain to a given definition 
of measurement and evaluation. In our overview of the first section we wish 
briefly to state this definition and then to explain the order and significance 
of the nine chapters in the section. 

By now, all of us have some idea about the nature of measurement, partic- 
•ularly the measurement of our physical environment. How many times have 
we measured temperature by reading a thermometer, measured length by using 
a ruler, measured the speed of a car by reading a speedometer, and measured 
lime by looking at a clock? Moreover, we have had countless experiences with 
evaluation in our everyday activities: judgments about this dress or suit as 
compared with that one, decisions as to party and candidate in elections, and 
the continual evaluations we make of our acquaintances. Then, too, in our 
years of schooling we have been subjected to innumerable measures of progress 
in school subjects, and each quarter or semester we have had our achievement 
evaluated by our teachers in terms of /I’s, B’s, and F‘s. So we should have 
some notions about the measurement and evaluation of educational things as 
well. 

With all this background of experience concerning measurement and eval- 
uation, each of us should be able to define them. So let us try it now. Write 
down your definitions of measurement and of evaluation. Then compare your 
definitions with those of your fellow students Notice how they vary, how some 
arc generalized and others particularized, how some omit what seems impor- 
tant to you and how others seem to include nonessentials. From our observa- 
tion of the definitions submitted by hundreds of college students at the begin- 

1 



2 


FUNDAMENTAL CONCEPtlONS AND PROCEDURES 


ning of a course in educational measurement, we will guess that your definitions 
vary so greatly that you well may wonder what you are going to study. 

Consequently, in the interest of clear communication, we ask you now to 
abandon your personal definitions and accept the one we offer. Like any defini- 
tion, ours is arbitrary but it is designed to cover the activities that we cus- 
tomarily call measurement and evaluation and it is further designed to make 
their study a rational and systematic process. Wc think you will find that the 
definition covers yours but may go beyond it or may require some change in 
your perspective. 

Definition of Measurement and Evaluation 

1. Measurement is the process of assigning symbols to dimensions of 
phenomena in order to characterize the status of a phenomenon as precisely 
as possible. 

2. Evaluation is the assignment ol symbols to phenomena in order to 
characterize the worth or value of a phenomenon, usually with reference to 
some social, cultural, or scientific standard. 

To illustrate this definition, let us consider the case of a student in a tenth- 
grade typing class. Along with other students, he was given a five-minute speed 
test and was found to type fifty words per minute with a total of seven errors. 
This is the process of measurement. The phenomenon in question was typing. 
The dimensions to be measured were speed and accuracy. The status of the 
boy’s typing relative to speed and accuracy was precisely characterized by the 
symbols, filty words per minute and seven errors. 

Next, the instructor consulted a manual that gave the degree of speed and 
accuracy to be expected of students m the tenth grade after given amounts of , 
instruction and accordingly assigned the student's test a grade of B. This is 
the process of evaluation. T he symbol B characterized the worth or value of 
the student's speed and accuracy in typing. 1 he B, meaning “good,” was as- 
signed because the boy’s speed and accuracy exceeded to a certain degree 
what was to be expected, the standard. 

Plan of the Section 

The first nine chapters of this book are essentially an expansion of this 
definition of measurement and evaluation, together with its corollaries. The 
symbols used in measurement are first discussed in Chapter 1. There wc 
will find that measurement symbols arc of three basic types: those indicating 
scale position, those indicating rank position, and those that simply denote a 
classification or constitute a description. The characteristics of measurable 
dimensions are discussed in Chapter 2, together with how dimensions may be 
construed in order to make them more measurable. 

In our definition of measurement wc stated that measurement is a process. 
This implies that certain procedures are used, and, in Chapters 3,4, 5, and 6, 
these procedures are discussed as they relate to behavioral measurement. Thefe 



FUNDAMENTAL CONCfiPl'IONS AND PROCEDURES 


3 


we find that measurement symbols may be assigned to dimensions either 
directly by the observer or indirectly through means of some instrument, tech- 
nique, or device. The construction of these instruments and devices is dis- 
cussed in detail. 

Chapters 7 and 8 develop certain statistical concepts and operations that 
are important for the measurement process. Finally, in Chapter 9 the activity 
of evaluation is analyzed, together with the problems of marking and reporting 
grades. In this chapter, a great deal of attention is given to the development 
and use of evaluative standards. 

In expanding the basic definition, the nine chapters not only describe how 
measurement and evaluation are performed but they also establish the variables 
that affect their validity and efficiency. These arc: 

1. The nature and appropriateness of the symbolic forms used to char- 
acterize dimensions (Chapter 1). 

2. The relative “measurability” of the dimensions appraised and the suit- 
ability of these dimensions to the nature of the phenomenon and the purpose 
of measurement (Chapter 2). 

3. The aptness of the procedures used to achieve the assignment of sym- 
bols to phenomena (Chapters 3, 4, 5, and 6). 

4. The appropriate application of statistical concepts and use of statistical 
techniques to recast the measurement data into interpretative form (Chapters 
7,8). 

5. The appropriateness and validity of standards selected lor evaluation 
and the efficiency with which status is compared with standard (Chapter 9) 

This, then, is a prospectus of the first section. It can also serve as a 
..summary and might well be reviewed vshen the section is concluded. 




CHAPTER I 


FORMS OF MEASUREMENT SYMBOLS 


The end resuJl of measurement is the most familiar and obvious of its 
several aspects. This is the symbol or symbolic expression used to characterize 
the status of something All around us, m newspapers, magazines, over the 
radio, the television, and in conversations, we see and hear numbers and 
other symbols expressing the resulis of measurements. 60-foot frontage, 1240 
kilocycles, 80 degrees, 25 miles per hour. 125 pounds, 85 cents per dozen, 
3 30 P.M., and 2.3 decibels. We are more lamiliar with the symbols of meas- 
urement than any other aspect ol the process simply because they arc what n 
communicated. After a scale has been used or a test applied, we arc no longer 
concerned with it, but only with the measuiement it accompilished. It is this 
symbol that we write down and talk about and use. 

In this chapter we shall analyze the symbolic expressions of measurement, 
not only in order to understand them but also in order to gam some initial 
insight into the measuiement process itself. First, we shall illustrate the several 
types of symbolic exfiressions used by man in liis etiort^ at measurement, 
following this, each form will be .subjected to more caieful scrutiny and com- 
pared with the othei Finally, the different forms of measurement will be 
viewed as a whole in terms of their significance for educational measurement 

Types of Measurement Symbols in General 

We have no precise historical record ol the first forms of measurement 
used b> man. We can, however, speculate that one ol the earliest attempts at 
measurement made by our ancestors was in connection with determining how 
many." This problem must have arisen earl) in the ^.ultural development ol 
man — ^jiist as soon as there wa^" concern about how many sheep or cnttlc were 
possessed how many warriors were in a given tribe, how many days had 
passed, and how many fish to catch in order that no one in the group would 
go hungry. 

According to archaeological findings, we have good reason to believe that 
the earliest expressions oi “how many” were svmbols meaning “few,” “many,” 
and possibly symbols for “one” and “two. ’ Ceitainl), in some of the primitive 
cultures existing today, the question of “how many” is answered only by terms 
that mean “one,” “two,” and “many.” Ai the same time or possibly a little later, 
man developed symbolic equivalents of “more than” and “less than” in order 

5 



6 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


to express his observation that some collections contain more objects or fewer 
objects than other collections. 

Later in the development of man’s conception of number, fingers, pebbles, 
and notches in wood were used as symbols for indicating the quantity of 
objects in a collection. 1 ally marks soon became systematized into primitive 
number systems, and finally, today we have the Hindu-Arabic number system 
that can represent ‘Miow rnanv” in any situation and to any desired degree. 

This sketch of how man has devised measures for one important dimension, 
quantity, serves to illustrate the basic loims of measurement symbols he has 
developed and continues to use. 

1 . Symbols that classify or describe. The symbols “few” and “many” arc 
of this Upe. They indicate simple categories or classes of quantity, and collec- 
tions are characterized by the categorical term assigned them. 

2. Symbols that indicate rank or order. ‘"More than,” “less than,” “larg- 
est,” “next largest," and “smallest” are examples of measurement symbols that 
indicate rank Here collections have been compared and designated in terms 
of their differences in quantity. 

3. Symbols that iudu ate a scale position. 1 he symbols of this type used 
in measuring “how man\” aie the numbers of the Hindu-Arabic system such 
as “315,” “27,” “9.” The numbers characterize collections as to their quantity 
by stating w'herc they belong on a counting scale. 

It is not necessary to rely on history to see thi use of these three types 
of measurement symbols. For instance, we might observe a child’s growing 
conception of temperature. His first measurements are usually “hot” and 
“cold.” Later on he may add lurther classification symbols such as “warm” 
and “cool.” Soon he is able also to rank objects on the basis of their tempera-j 
ture by using such rank or comparative symbols as “hottest,” “warmer than,” 
and “coldest.” 1 he final stage is reached when he is able to report the tempera- 
ture of an object by use of a symbol that indicates a position on the temperature 
scale, e.g., 87 degrees or 62 degrees Fahrenheit. 

Another example is provided by a child's development of expressions for 
speed. At first it is siinpiy “last” and “slow” — measurement symbols that 
classily or describe. Then tollcw the symbols “fastest,” “slower than,” and 
“slow'cst” — measurement symbols that indicate comparisons of rank or order. 
Finally, the child can express with understanding the speed of an object in 
miles per hour — measurement symbols that indicate a scale position. 

While numerous other examples could be provided, both from the develop- 
ing concepts of children and from science as a new phenomencm is studied, 
these should suffice to illustrate the three basic forms of measurement. Now 
we need to examine each type in detail 

Measurement Symbols That Express a Precise Scale Position 

We shall discuss first the symbols that express a definite scale position, 
since this form is most commonly associated with the word “measure.” The 



FORMS OF MEASUREMENT SYMBOLS 


7 


examples provided at the beginning of this chapter — 60-foot frontage, 125 
pounds, 85 cents, 80 degrees — are all symbols that express a scale position. 
Each time we read a ruler, a clock, a speedometer, a weight scale, a ther- 
mometer, or count objects, we are finding a measurement symbol that expresses 
a scale position. 

A scale consists of a series of units, each of which represents an exactly 
defined portion of or degree of variation in the dimension being measured. It 
is, of course, preferable to have the unit universally standardized so that it 
may be more widely used. In expressing “quantity'" by an Arabic numeral, the 
unit is considered to be the individual indivisible object that comprises the 
collection being measured. In the case ol “how many people,” the unit is a 
“person” and in the case of “how many years,” the “year'" is considered the 
unit. We also know that a unit of weight is the pound, a unit of length is the 
inch, and so on. In the expressions of scale positions, we always find the unit 
included or at least inferred — “3 inches,” “3 dollars,” “3 seconds,"’ “3 
pounds ” Ordinarily, the unit is constant thioiighout th'" scale and, if this is the 
case, then the difference between any two adjacent scale positions is every- 
where the same. This idea is illustrated by the fact that the dilTerence between 
a length of 78 inches and a length of 75 inches is the same as the diflerencc 
between a length of 1 1 inches and a length ol 8 inches. 

Most ol the scales that we have seen in our e\er>day affairs ha\e a zero 
point, which ordinarily represents complete absence of the phenomenon being 
measured. On our scale of “quantity zero would represent a collection con- 
taining no objects. Also, on a weight scale, zero would represent the absence 
of any weight, and, on a tape measiue. zero would repicsent the absence of 
•length. Measurement symbols that refer to positions on scales that have zero 
points can be added, subtracted, muluphed, and diMded, and. furthermore, 
ratios can be computed. For example, if an object weighing 28 pounds were 
combined with an object weighing 14 pounds, the weight ot both objects to- 
gether would be 42 pounds Iv this case, the measurement symbols 28 and 14 
can be added meaningfully and, furthermore, these two symbols can be sub- 
tracted, multiplied, and divided, and the results will ha\c meaning as far as 
this particular scale is concerned. Iii addition, these two s>inhols can ^e set 
into ratio with each other b> stating that the second object weighs halt as much 
as the first object. 

Several scales, although the) have derimlcly established units, do not have 
absolute zero points. The ordinary Fahrenheit and centigrade temperature 
scales are examples ot this type of scale While each of the^e does have some- 
thing called a zero, these point‘d are purely arbitrary and do not represent the 
point of complete absence of temperature that is called absolute zero. Another 
example is the scale of IQ's, a complete absence of intelligence having yet to 
be found. Measurement symbols that refer lo scales without absolute zeros 
still can be added, subtracted, multiplied, and divided, but ratios should not 
be computed. We do not say that 80 degrees Fahrenheit is twice as hot as 40 



8 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

degrees Fahrenheit, nor do we think that a person with an 10 of 120 is twice 
as intelligent as a person with an 10 of 60. 

The crucial consideration for use of a scale torm of measurement is, of 
course, the existence of defined portions or of defined units of variation in the 
dimension in question. When these arc nut present, scale measurement is in- 
applicable, despite its obvious advantages. Unfortunately, many of the dimen- 
sions of educational phenomena have no such defined units and hence may 
not be measured by scale symbols, but instead must be appraised by the forms 
that require no definite units, i.c., ranking, classification, and description. 

Measurement Symbols That Express a Rank or Order Position 

To explain rank or order symbols, let us analyze a situation in which this 
form of measurement normally is used. Suppose a sales manager wishes to 
indicate the sales ability of one oi the salesmen on his stall oi tittecn salesmen. 
He could say simpl> that he is the best salesman, or the worst, or near the top. 
These terms are gross indicators of rank and are like the primitive and childish 
expressions we discussed earlier. A rmn*c precise measure w'otild require that 
he arrange the fifteen salesmen in the order ol their salc.s ability and then note 
what numerical rank falls to the salesman in question It would be possible 
for the salesman to have a rank position ranging trorn 1, which is the lop 
position, down to 15, which is the bottom position 1 he number of his rank 
would be the measure ot the man’s sales abiht} relative to this particular group 
of salesmen. Rank is useful as a measurement svmbol m many other situations, 
beauty contests, the judging of animals and exhibits at a fair, and. ol course, 
in school. It does not require units and a separate scale but only comparisons 
among individuals or obiects. • 

While in our example Arabic mimbeis were used as symbols of rank or 
order, any set of ordered symbols can be used f(>r this purpose. A, li, C , D, 
or a, fl It is important to notice that these symbols have 

little meaning when they stand alone. For example, if ve were told that a 
certain person had a rank position ot 13 with rcgaid to sales ability, we would 
know very little unless we were also told the si/e ol group in which the ranking 
took place. A rank ol 13 would be relalivelv high in a group of 432 but rela- 
tively low in a group ot 15. 

In order to overcome this deficicnc> ol rank numbers, percentile ranks 
have been developed These indicate the percentage of any group that lies 
bclow^ the person in question Hence, to assign a person a percentile rank of 
78 would mean that he is better than 78 per cent oi the group This standard 
way of expressing relative position is discussed in detail in Chapter 7. 

Symbols that indicate relative position usually arc numbers and they at- 
tempt to serve the same basic measurement function as do the numbers that 
indicate a scale pc’isition. However, they involve basic ideas and procedures 
that are quite different from unit-scale numbers. Some of these differences are 
cited below. 



FORMS OF MEASUREMENT-*YMB015 


9 


1. Rank symbols generally provide for only a limited amount of com- 
parison. One of the more obvious limitations of rank symbols is that they have 
meaning only in terms of the group in which the rank was established. If a 
student stands third highest in spelling in his class, this rank symbol has no 
meaning with respect to any other class or group of students. It is quite con- 
ceivable that the rank of this student’s spelling achievement would change 
considerably if he were placed in another group. On the other hand, scale 
symbols, which arc based on standardized units, can provide for universal 
comparison. A student’s height symbolized in inches provides a basis for mak- 
ing a comparison with the height of anyone else in the world. A rank symbol 
does not have this possibility. 

Steps can be taken, however, to oflsct this limitation of rank symbols. The 
most obvious step is to extend as much as possible the range of the group in 
which the ranking is done. Another step is to rank on the basis of a group 
that is representative of a much larger group. Rank symbols based upon such 
a standard group arc called norms (sec Chapter 7 ). 

2. Rank symbols cannot indicate the extent of difjcrcncc betw een adjacent 
ranks. If a student ranks fifteenth in his class in arithmetic achievement, there 
is no indication of how much belter he is than the student w'hose rank is six- 
teen, or less than the student with rank fourteen. On the other hand, with scale 
symbols we can determine precisel) the difference bctw'ccn two students. 

3. Rank symbols are Inmted as to what can be done to them matin tnati- 
cally. In connection with scale symbols, we noted that they can be added, sub- 
tracted, multiplied, and divided meaning! ul]\ Rank symbols cannot be so 
manipulated. Adding a rank of 3 to a rank of 5 wimld certainly be a mcaning- 

•icss operation. In Chapter 7 we shall indicate what mathematical (Operations 
can be performed with rank symbols. 

4. Scale symbols Jtiay be converted into tank symbols but rank symbols 
may not be champed to scale sytnbols. If we know a student’s height in inches, 
for example, w'e can easily find his rank in height among the other members 
of his class. We need only to have the height in inches of each member of the 
class and then arrange these in order of size and note the relative position of 
the student in question. On the otlicr hand, rank s>niboh cannot be changed 
into scale s>mbols for the reason stated in 2 above 

Ail these specific differences between uiiil-scale s\mbols and rank symbols 
stem from one basic dillercncc Unit-scale sNinbols are based upon a unit of 
measure, generally a standaidized unit, while rank symbols of measurement are 
not based upon any unit of measure but only upon an observed difference be- 
tween individuals or objects. The amount of this difference is unimportant for 
determination of rank, but it is exact and critical for scale measurement 

Measurement Symbols That Classify and Describe 

7’hc third form of measurement used by society again docs net make use 
of a unit of measure and yet it is extremely valuable in characterizing the status 



10 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


of many phenomena. It involves simply an indication of the classification or 
category to which something belongs. Model numbers are an example. By 
saying that a certain car is a 1952 model, wc indicate that it belongs to the 
class of cars made in 1952. Similarly, we can speak of a B-47 airplane, a 
model 3C242 ccntrilugal pump, and a model S62 office desk, all of which 
symbols indicate the category to which the objects belong. Other examples 
of measurement by classification are the symbols used in Selective Service, 
serial numbers and file numbers, the Dewey decimals on library books, and 
various job classifications, c.g., GS-5 in the federal civil service. 

Each classification symbol ordinarily l■epre^"mts a description, drawing, 
picture, or set of specificalitms that portrays the distinguishing characteristics 
of the persons or objects comprising the classification. In Selective Service 
classification, II- A means: 

Jn class II-A shall be placed any rcgistianl found to be a “necessary man” in 
any industr>, business, employment, agricultural puisuit, governmental service, or 
any other scivice or endeavor, or in tiaining or preparation therefor, the main- 
tenance of \vhich IS necessary to the national health, safety, or interest in the sense 
that It is useful or productive and contiibutes to the employment or well-being of 
the community or the nation ’ 

In the same manner, the classification symbol ''1952 model” foi^a particular 
automobile represents a picture or outline dravMiig of the car and a description 
of such characteristics as the over-<ill dimensions, number of cylinders, type 
of carburction, shock absorption, engine mounting, water cooling, brake 
action, and grillwork. 

Oltcn, instead of letters or numbers, words or short phrases may be used as 
classification svmbols. A good example is found in the words often used by 
the Weather Bureau to express the status of wind Ihese are: calm, light 
breeze, gentle brcc/c, moderate breeze, fresh breeze, strong wind, gale, and 
hurricane. Each term symbolizes a descriptive category. For example, “mod- 
erate breeze” stands lor the following description: “raises dust and loose 
paper, small branches of trees arc moved and the leaves and small twigs arc 
in constant motion, extends a heavy flag ” 

As a further example, consider a leaciicr who washes to appraise pupils 
with regard to acceptance of authority. He could use, and many teachers have 
used, words such as these (t) represent eategorics of variation regarding this 
dimension: defiant, sullenl) complaint, ordinarily obedient, respectful, and 
completely docile. Each of these classifications needs to be described in detail 
so that the symbols will have a specific meaning. What the teacher uses in this 
case is generally called a behavior-rating scale. This device is discussed in 
detail in Chapter 4, pages 54-59. 

From our discussion, it is apparent that the role of the class symbols them- 

^ Selective Service Regulations, Volume HI Classification and Selection (Washington, 
D.C.: U.S. Government Printing OfTiec, 1940i, p. 19. 



FORMS OF MEASUREMENT SYMBOLS 


11 


selves is purely nominal — they simply serve as convenient shorthand expres- 
sions. The crucial part of this form of measurement is the description of the 
various categories. For effective classification, the description should spell out 
as completely as possible the distinguishing characteristics of the category. 
This should include, if necessary, the typical situations in which these charac- 
teristics are observed. Secondly, the description should clearly mark the boun- 
daries between adjacent categories. Ordinarily, this is done by drawing con- 
trasts between items belonging to diflercnt categcuies These requirements for 
the descriptions suggest that setting up a scheme ol classifications for a given 
phenomenon is not an easy task 

It is important to note here that a classiliealioii scheme can deal with but 
one dimension or aspect of a person or object, and that it provides onl> for 
gross characterization as to this one dimension T he Selcctixc Service classifica- 
tion system, for example, is locused specifically upon eligibility for service in 
the armed forces and it contains only a few categt^nes eligibility. As a result, 
men who are quite different in many respeets may find themselxes in the same 
eategory. An electrical engineer, a physician, and a dentist might all he gi\en 
a Il-A classification. Moreover, each II-A might aciualK deserve slightl) dif- 
Icrent treatment by a draft board Similarly, tlic classification scheme for ac 
ceptanee of authority, which was provided earlier, ignores ditlerences in 
honesty, intelligence, age, etc., and also the slight differences in acceptance of 
authority that would necessarily obtain tor ail nupilN classified aN ‘‘oidinariK 
obedient.” 

Teachers often need a broader and more precise characterization of their 
students than classification provides This occurs particularK when the teacher 
ivishes to evaluate such things as study habits, citi/cnship. socku adjustment, 
etc. To classify pupils regarding these complex dimensions would simply not 
help much. So the teacher must report to dcscnhim,^ the pupil with man} words, 
phrases, and sentences. 

Description and classification are, of course, hardK scp<irable. If we w'crc 
to consider that cadi individual in a group coiiJd constitute a category in him- 
self, then the description ol the induiduai would be the description of the 
category and the forms would be synon\mous In pract’ce, though, we think ol 
classifying wTien the categories are min.h les^ numcioiis than the individuals 
(as wath selective service). Moreover, me wouis and numbers that ‘sNmbolize 
classifications are not essentially diffcicnt from thi vords and numbcjs that 
are used to describe. But again in practice we think ol certain wo/d-^ more as 
classifications (genius or moron) and others more as simple desenber^ (highly 
intelligent or vcr>' dull ) 

In description, the intent is primarily to identify, to di>linguish, to char- 
acterize an individual infoimally with respect to an unliinted number of aspects 
or dimensions. In classifying, on the other hand, and in scaling and ranking, 
w'e arc more concerned with comparisons and appraisals with lespcct to a 
single dimension or aspect. What is necessary tor a valid calegorv description 



12 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


is likewise necessary for a valid description of an individual. It must indicate 
the salient characteristics and the boundao^ of the phenomenon. Words, 
letters, and numbers arc all used in description as they arc In classification, and 
frequently some of the descriptive entries arc expressions of rank, scale posi- 
tion, or classification for components of the phenomenon. The form does not 
facilitate comparisons among individuals, but it can characterize the individ- 
ual’s unique status as no other form can. 

Description is widely used by teachers and psychologists, both for the 
reason of its inherent validity for complex phenomena and because other forms 
are too often inapplicable. Application of the form is discussed in detail in 
Chapter 4. Here it may suffice to offer an example of a description as used in 
education. The following is quoted from a record of an interview made by one 
of the authors: 

. . . drew a pjctiirc of his family and discussed his feelings with the examiner. 
The picture figures were stereotyped and more like those ot an eight-year-old than 
a tw'clvc-year-old. Ihey disclosed no particular family feelings or relationships. His 
very rapid speech and his occasional stuttering both became accentuated when he 
was questioned about his family. He answered questions about them readily and 
his answers connoted defensive pride to the examiner. He volunteered that he used 
to steal but stopped when his father whipped him. He thinks stealing is wrong. 
When questioned further about what he had stolen and when, he iliished and be- 
came incoherent in his relation. No detads were elicited. He admitted that his twin 
still steals. 

Forms of Measurement Viewed as a Whole 

Up 10 this point we have discussed the various forms of measiircmcnl 
symbols and the advantages and diflicullies of each of them. Numbers, words, 
and letters, singly and in groups, measure as they indicate scale position, as 
they designate rank or t^rder, and as they classify or simply describe. We have 
treated the forms separately, though contrasting them, and in so doing we 
have oversimplified the matter. We have acted as though a phenomenon to be 
measured might, as a rule, be measured as a wdiole and by a single form, rhis, 
of course, is scldi)in the case. Most phenomena of interest to educators have 
varied aspects that require different forms of measurement. Consequently, to 
measure these phenomena requires a combination of forms. 

To illustrate, consider a boy’s “academic aptitude.” A very minimum 
measurement of this phenomenon involves some scale symbols: age twelve, 
IQ, 110, vision in both eyes 20/20. It entails some rank or order symbols: 
80th percentile in arithmetic in the 7th grade, has a reading age of 12-2. It 
necessitates some classification: he is a pupil and a “good citizen.” And, 
finally, only a description of his study habits may be given. 

A choice among the several forms — scale, rank, classification, and de- 
scription — is more a function of the phenomenon to be measured and the 
purpose of the measurement than it is of the forms themselves. Of course, it 



FORMS OF MEASUREMENT SYMBOLS 


13 


pays to measure w?th numbers ^causc arithmetic may be applied to them. 
It is better to scale than to ranlR ftccausc scale numbers can be added, sub- 
tracted, etc., while rank designates cannot be. And if comparison among com- 
parable phenomena is desired, description is of little help. But, beyond these 
considerations, you should determine the forms to be used on the basis of what 
you wish to measure, for what purpose you measure, and the evaluative stand- 
ards to be applied. I he details of this interrelationship between form, object, 
purpose, and evaluative standard arc specified in the several chapters of 
Section II. 

Whatever symbolic forms arc used, they need lo meet certain conditions 
if accurate measurement is to be accomplished. These arc in addition to, or 
perhaps in re-emphasis of, some of the conditions specified for the pc'rlicular 
forms. 

1. The forms should he appropriate to the type of variation to be meas- 
ured. As we will see in the next chapter, educational phenomena manifc>l two 
general types of variation Some, such as intelligence, invf^lvc serial variation. 
Pupils differ as to degree or amount and, consequent!}, either scaling or rank- 
ing may be used because both express differences in degree or amount Ihc 
secemd basic sort of variation is that of type or category This is exemplified 
by Nocational and recreational interests. Here pupils difler as to the type of 
work they like to do and the type of sport or game they prefer Obviousl}, 
some sort of classification or descriptive form must be used to characterize 
such categorical variation. 

2. They should exceed the ratine of possible variation in the tiling hein^ 
measured. Without this, cases touching or exceeding the limits of the scheme 
are unmeasured An cxjimpie is a ruler painted on the wall, staiting at four 
feet and running lo seven. Some midgets, less than four feet high, and some 
giants, more than seven feet tall, are ol unknown height according to this scale. 

3. They should cover the ran^e of variation completely. Obviously, if a 
gap is left, we do not measure with any precision a case that happens to fall 
within the gap. 1his error is frequentl) made in dichotomous classification. 
For example, if in a sociological survey we tr) to classilv all per^^ons as married 
or unmarried, we w'ill have difiiculty with “common marriages" and even 
more casual relationships in which chiiilrcn and parenthood are involved. 

4. The symbols should have standard and limited meaning. Since most, if 
not all, measurement involves or results in communication, it almost goes 
without saying that vmr terms must mean essentially the same thing all who 
are to use them Moreover, if we do not know where one s}mbol, say upper- 
middle class, ends and another, lower-middle class, begins, we have a great 
deal of trouble appraising those cases that fall near what should be the limit 
of the symboPs meaning. 

5. Finally, the appropriate form of measurement should be selected with- 
out rci^ard to preconceptions about the superior value of numhc*^s. l^nfor- 
tunately for education, numcncal measurement has a tremendous amount of 



14 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

prestige. This is primarily because the nearly liijr^ulous progress of the phys- 
ical sciences has been associated with the us^of numerical measurements. 
Many of us who practice education have come\o believe that effective meas- 
urement simply has not been accomplished unless numbers have been obtained. 
Moreover, we are apt to get a certain feeling of security in dealing with num- 
bers. They arc exact and objective, and furthermore pupils are not prone to 
question numerical results. 

Because of our identification of numerical measurement with scientific 
measurement, some teachers, administrators, and even psychometrists and 
psychologists have made serious errors in measuring pupil achievement. Too 
often wc have appraised only the more superficial aspects of pupil knowledge 
and skill simply because they can be measured by numbers. We have counted 
the number of pages a pupil has read, the number of homework problems he 
has completed, the number of punctuation errors in his sentences, the number 
of recitations he has made, and, of course, the number of true-false questions 
he has answered correctly. I his wc have done, to the neglect of his understand- 
ing of what he has read, what he has learned fiom his homework, the smooth- 
ness of his sentences, the quality of his recitations, and the merit of his reasons 
for answering true or false. The latter category of things demonstrably is more 
significant than the former, but it is more difficult to measure with numbers, 
and particularly w'ilh scale numbers ^ 

We take the \icw tliat efjcctnc measurement is scientifu and deserves 
prestige, not just a scale form oi ineasurcincnt. As a matter of fact, the 
‘'scientisf' uses each of the other \)ims wlicn the situation requires it. The 
chemist describes the attributes ol a new synthetic fabric. The physicist rla.s- 
sifies a type of radiation as beta or gamma. The engineer ranks building nuf- 
terials according to their durability The diflorencc between the measurements 
of the skilled teacher and the scientist is one of degree and not of kind. Be- 
cause the objects of the teacher's measurements are the more complex and 
intangible, his measurements are the less precise. But when the scientist’s ob- 
jectives arc equally complex and intangible, his measures arc equally imprecise. 

We should consider that lorm of measurement to be optimum which most 
precisely charaeieri/cs the essential dimensions of a phenomenon, in educa- 
tion, this means that dassificatory and descriptive words arc often the optimum 
forms because many of the dimensions we need to measure may not be validly 
characterized by raw numbers alone. 

Summary 

The purpose of a measurement symbol is to characterize the status of that 
which is measured. Forms in current use may designate scale position, rank or 
order within a group, classification, or dimply constitute a description. 

The most precise of these forms is scaling, and scale numbers may be 
manipulated mathematically. However, scaling requires a zero or other fixed 
reference point and the existence of defined and constant units of difference. 



FORMS OF MEASUREMET^T SYIWOLS 15 

Since there are no constant |fnj|p or ^ro points lor most educational phenom- 
ena, scaling is seldom applical^le. -Rank and classification symbols may be used 
without reference to units or ^ros and, consequently, are widely applicable in 
educational measurement. These two forms arc less precise, though, and are 
not amenable to addition or multiplication. Description is the necessary re- 
course for many complex phenomena and, while descriptions arc not easily 
compared, they can constitute very exact appraisals 

The primary considerations in selection of a mcasuicmcnt lorm arc the 
objects of measurement and the standards by which they are to l)c evaluated 
In addition, any system of measurement symbols should satisfy these condi- 
tions: ( I ) Be appropriate to the type ol variation to be measured, (2) exceed 
the range of variation in the things being mcasiiied; (3) cover this range 
completely; and (4) have standard and limited meaning 

In the final anal) sis, the measurenienl symbols used in education should 
fulfill their intended function of characten/mg the tus ol complex educa- 
tional phenomena Because ol the nature ol these ph^momcna, m('»rc reliance 
should be placed upon a combination of nmk s>mbols ind ciasMficaiory and 
descriptive expressions without regard to the superior 'T^^stige*’ of numerical 
measures. 


FXTRCIStS 

J. Select a variable phenomenon such as sound Lolor, ^lopc, oi "osl ol li\ing 
and give examples ol each ol the ihrec t}pes of svmbols that ma\ be used in its 
^icasuiemcnl. 

2. The statement nudv *Tiie sk\ is blue Is ihi^ measiiument^ I xpl.un vour 
answer. 

3. If a phenomenon can he measured bv a LlassiHc-mon ^eheme eriK. what 
can be said about the phenomena. n‘^ 

4. Identity the catc^oiics oi a classifieation sehe no lor sulIi phenomena as 
fall mindedncss, hc'>ncst\ pcrseveiance, sludv habiis, miis».ular co-orJnialion, and 
social adjustment 

5. Susan has a voe.ibulai) of ^00 Trench word's and Sall\ has a \L**.abu' tv ol 
700 French woids What tvpc ot nicasufenwnt s\mbol used^ What can be ^aid 
about the ditfcrcnce in F rench ^o"lbul.u^ lutvscen Susan and Sally' Lxplain vour 
answers. 

6. For each ot the lollowing measuremeni ^vmboK oltcn encountered in edu- 
cation, idcnlifv the type ol symbol used and justitv voui answer 

IQ 1 05 on Stanford-Binct tcsi 

85 per cent on a teacher-madc arithmetic test 

Spelled 15 out of 20 words correctly 

Strength of grip, 15 pounds 

Received 1 ] 3 votes in a senool election 

Mental age eight >ears, six months 



16 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


Fifty-seventh percentile on a standardized test 

Sixth-grade reading level 

Received a B in the course 

Grade point average or honor point ratio is 2.87 


BIBLIOGRAPHY 

1. Campbell, N. R., An Account of the Principles of Measurement and Calcula- 
tion. London; Longmans, Green and Co., 1928. 

2. Lorge, L, “The Fundamental Nature ot Measurement,’’ hduLational Measute- 
nient (E. F. Linquist, ed.). Washington, D.C.: American Council on Education, 
pp. 533-559, 

3. Travers, R. M W., F.ducational Measurement. New York: The Macmillan Co., 
1955, Chap. III. 

4. Stevens, S. S., “On the Iheory ot Scales of Measurement,” Science, 103:677 
680, 1946. 



CHAPTER 2 


PREPARING PHENOMENA FOR MEASUREMENT 


In the chapter just completed we examined educational measurement from 
the viewpoint of the symbolic expressions we assign to things when we measure 
them. During this examination we discussed three symbolic forms, classifica- 
tion-description, ranking, and scaling. We discovered that they dilter as to 
precision and significance and thus that the form of symbol to be used affects 
the adequacy of the measurement. So it is with the phenomena themselves, the 
subject of the present chapter. They differ widely and in their variation they 
affect the efficiency of the measurements made of them. Some are relatively 
easy to measure, others difficult. Some lend themselves to direct measurement, 
others only to indirect procedures Some permit precise appraisals and many 
others, of course, are susceptible only to very gross measurement 

In our study of measurable phenomena we first shall review some of the 
problems posed by the array of educational things for which measurement may 
be desired. Then attention will be given to the way in which phenomena must 
be construed if they are to be measured. Several basic attributes of measurable 
dimensions will be cited. Wc shall discuss inferred dimensions and the dimen- 
sions of phenomena that constitute constructs. Some attention will be given to 
issues relevant to the selection of dimensions for measurement. Finally, we 
sliall discuss the basic dimensions of pupil achievement in school subjects. 


Educational Phenomena for Which Measurement May Be Desired 

We use the word “phenomena” as a collective symbol for the possible 
objects of measurement because it is about the only word sufficiently general 
to encompass all the various things that teachers and other educational per- 
sonnel say they wish to measure. The following list suggests the great variety 
of these phenomena : ^ 


Ability (in) 

Abstract thinking 

Art 

Achievement (in) 

Music 

Arithmetic 

Tennis 

Social studies 

Etc. 

Etc. 


Adjustment 


Aggressiveness 


1 Derived from indexes of books on educational measurement and from the experi- 
ence of the authors in teaching and in working with schools on measurement problems. 

17 



FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


18 

Aptitude (in) 
Clerical tasks 
Engineering 
Physical education 
Etc. 

Attitudes, in general 
and toward 
Democracy 
Minority groups 
School 
Etc. 

Character 
Citizenship 
Community attitudes 
Creativeness 
Critical thinking 
Developmental age 
Dexterity 


Effort 

Gain 

Group structure 

Handwriting 

Interests (in) 

Play 

Reading 

Vocations 

Etc. 

Intelligence 

Knowledge (of) 
Chemistry 

Contcmpoiaiy affairs 

Hisloiy 

Literature 

Etc. 

Leadership 
Mental age 


Personality 

Problem-solving ability 

Readiness (tor) 
Arithmetic 
Reading 

Skill (in) 

Cooking 
Typing 
Wood work 
Etc. 

Speech 

Study habits 

Understanding (of) 
Geography 
History 

Physical principles 
Etc. 

Vocabulary 


The phenomena that educators wish to measure constitute an unsystematic 
array and they have several characteristics that make for diUicully in measure- 
ment. For one, the items are neither mutually exclusive nor collectively in- 
clusive of all that may be significant. For example, skill, ability, and achieve- 
ment each may refei essentially to the same thing, and so with aptitude and 
readiness, and with knowledge and understanding. Yet the uninitiated may 
think that they refer to different things and may try to measure them separately 
with foredoomed frustration. Then, wherever items are free of overlapping, 
they often do not cover all the ground they should. 1 o predict a pupil’s success 
in Grade IX science, we might think we need only to measure his intelligence, 
his previous knowledge, his attitude toward school, and his study habits, since 
these are the things usually measured in predictions of school success. Yet we 
could appraise these \ery precisely and still not predict succes*^ in Grade IX 
science with any great accuracy because there are additional things involved 
in school success; strength of motivation, efficiency of instructioi^adfeipi^cy* 
of materials, etc. i/" ' 

A second troublesome characteristic of the array is that top many of the 
items are high-ordcr abstractions having no clear, agreed-upon definition. 



PREPARING PHENOMENA FOR MEASUREMENT 


19 


Character, citizenship, knowledge, personality, and intelligence, for example, 
have nearly as many definitions as there are tests for them and the definitions 
too often are in terms of equally abstract words. We shall find in a few pages 
how important is clear definition for measurement and how difficult is the 
measurement of abstractions. 

In the third place, measurement is used tor a variety of educational pur- 
poses: marking, programing, prediction, diagnosis, research, administrative 
planning, curriculum evaluation, public relations, etc. Frequently a given 
purpose may involve a special point of view toward a phenomenon and, il 
several points of view exist for a single phenomenon, there may be confusion 
in its measurement. To illustrate this contusion, consider the measurement 
of reading. A teacher needs to give fifth graders a mark in reading; so she 
measures them to see how well each has learned to read the materials he has 
used A junior high counselor needs to assign students to diflerent English 
classes; so he measures the students to see how they vary in reading ability. A 
remedial teacher must help cases of reading retardation; so he measures to 
see just what are each pupil’s reading difficulties. Finally, the superintendent 
wishes to impress the public with the excellence of the district’s reading pro- 
gram. Consequently, he measures to see how 'ar the mean*-’ reading ability 
of each grade is above some norm tor the grade. Because of their varied 
purposes, each of these persons is likely to use different tests ot reading and 
to express the results of his measurements differently. Yet each may consider 
that he has measured the same thing as the other — reading 

Finally, most of the phenomena arc behaviors, some are covert behaviors, 
and a few arc terms for inferred stales of mind or emotion that underlie 
behaviors As such, they are processes, not things They change, they occur 
and then the) do not occur; often the act of measurement distorts them; and 
they have lew to no CAaci physical properties. Consequently, the practice ol 
repeated measurement under the same conditions that gives the physical 
sciences their exactitude is largely obviated in education by the very nature 
ol the phenomena needing measurement. 

Because of such characteristics, it is apparent that educational phenomena 
must be carefully examined with a view' to their measurability Many ot them 
will need to be redefined oi otherwise prepared betoic measurement is at- 
tempted 


The Nature of Measurable Dimensions 

It is axiomatic that we have to mcasuie things in terms of their aspects, 
properties, qualities, or dimensions. We say a man is 6 feet tall, is one 170 
pounds in ^\ ci^ht, and is a pinkish c olor. it wc omit the dimension designation 
— ^just say that he is twcnly-sc\cn — we count on the listener 

2 For cxpISjkation of mean, sec pages 144-149. 

3 For explal&tion of norm, see pages 158-160 




20 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


to infer what we omitted, probably Aat his age is twenty-seven years. The 
names of these things relative to which we measure phenomena are many: 
factors, variables, properties, conditions, parameters, qualities, dimensions, 
etc. For convenience we shall use the single word dimension. Anything that 
we nieasurc a phenomenon “in terms of” or “with respect to” we shall call a 
dimension. 

If we measure phenomena only in terms of ihcir dimensions, it follows 
that we can measure them only to the extent that we have identified their 
dimensions and only to the extent that these identified dimensions lend them- 
selves to measurement. Viruses constitute a case in point. When viruses first 
were hypothesized by physicians as the causes o^‘ disease, they were hardly 
more than a word for “cause unknown,” since only the property of “causing 
a symptom” or “not causing a symptom” was established for them. Appraisals 
of their status were conjectural, vague, and subjective. It was only as such 
definite dimensions as “size of filter through which they might not pass,” “rate 
of growth,” “preferred culture,” “chemical make-up,” etc., were indicated 
that physicians could say it was possible to measure viruses. 

It should be noted that there may be only a relative distinction between 
a phenomenon on the one hand and a dimension on the other. Ceitain things 
may constitute the dimensions of a given thing; but when they arc viewed 
just for themselves, they in turn have dimensions For example, to determine 
the status of a school pupil we need to measure his age, height, wetght, knowl- 
edge of school subjects, intelligence, personality, background, etc., such factors 
being basic dimensions of the pupil. But we may wish to measure his intel- 
ligence alone and to do so we identify and apprai.sc its dimensions, such things 
as memory, perceptual discrimination, vocabulary, and reasoning ability. 

Dimensions are considered to be measurable; that is, capable of being 
described, classified, ranked, or assigned a scale number, to the degree that 
they approximate five basic conditions. As will be apparent, some educational 
phenomena already possess identified dimensions that meet the conditions, but 
others do not. In the latter case it will be necessary to recast the dimensions 
so that they do meet the conditions. In some instances, entirely new^ dimensions 
may have to be designated. 




I. Mea.surable dimemions are common to a group or cla.ss of things. 
First, as a matter of logic we should notice that a measurable dimension is 
and not to an individual only. Obviously, all the myriad 
^dtoting to educators pertain to many if not to all pupils, 
w^ght^curiing ability, motivation, etc., are exhibited in varying 
y all pupilj. Bit suppo.se we had a dimension absolutely unique to 
0^ iWeasu 


sure it? Suppose John was found to have a property 


Tees 


[mX phJJth could W 

that no one elsp^ver had. If we arc to measure it, we must assign 
cl^:ji^e^or describes, ranks, or shows a scale position. 
..Joiinl Status w'ith respect to the proj^erty by classifying? Onl; 
classes: pupils with the property and those without it. Can w^ 



f- 



PREPARING PHENOMENA FOR MEASUREMENT 


21 


That requires the use of descriptive ^ords and/or numbers, and the very 
nature of words and numbers is that they refer to previously encountered 
properties or dimensions. Can wc rank the boy with respect to the property? 
No, we have only one case. Finally, can we measure the thing in terms of a 
scale value? The existence of a scale requires the existence of previous’ phe- 
nomena from which the scale was derived, so we could have no scale. Ap- 
parently, then, we can appraise John’s unique property only to the extent of 
saying that he has it and that no others do, which is no more than we knew 
when we started to try to measure it. 

To reassert the condition, measurable dimensions are necessarily relevant 
to a group and not to one case or one individual only. 

2. A measurable dimension must provide sensory data. A second impor- 
tant characteristic of measurability is demonstrated by the behavior of persons 
who engage in actions called measuring. A carpenter measures the length of 
a hoard by looking, to sec what line on his jointed rule is at the end of the 
board. In judging the range of a pupil’s voice, a music teacher listens, to deter- 
mine which piano notes arc like the highest and the lowest sounds the pupil 
can make. To measure the gap in a spark plug, the mechanic jeels which of 
several thin leaves or wires will just fill the gap. We might go on to call on a 
cosmetician to appraise the strength of a perfume through smelling and a 
gourmet to determine the amount oi garlic in a pot of soup through tasting. 
And even when the teacher measures his pupils' knowledge of history through 
the device of a true-false test, he finally looks, to see which questions are an- 
swered correctly and which incorrectly. 

It is apparent from these examples that measurement entails some kind 
f)f sensory data; some person has ultimately to receive some sensation to ac- 
complish the measurement.* So, to be measurable, a dimension must provide 
some sensory data. Moreover, the more discriminating are the senses aroused, 
the more the dimension lends itself to direct measurement. For example, we see 
things more discriminately than we feel things. Thus, we measure height di- 
rectly by looking at the pupil and the ruler, but we must measure weight 
indirectly through looking at a dial or a balance because the kinesthetic sensa- 
tions by which wc feel w'cight are too indiscriminate to provide a reliable 
measure Finally, the more clearly differentiated are the sensations provided 
by the dimension, the more precise is likely to be its measurement. Wc are 
apt to be more accurate in counting the pupils in a given class than we are in 
judging how bright is the light in the classroom. The difference between six 
pupils and seven pupils is easy to see, whereas the difference between ten 
foot-candles of illumination and eleven foot-candles is less distinct. 

^ devices as automatic pilots and gun directors, many measurements are made 

aijd "aiiVnot observed by a person but, rather, by another part of the device. However, in 
these cases you may consider that a part of the device is an “observer,” a substitute for'* 
a person, or that the pilot or gun chief could see, hear, or feel something if he wished to 
or that he could attach instruments whose movement iic could observe. 




22 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


In education, the problem of sensory data is particulaily acute Such things 
as speaking, singing, and studying can be observed but, being events, they 
ordinarily provide only nonrecurrent sensations Such things as knowledge 
of history, understanding of science, intelligence, and attitudes toward school 
cannot be observed at all, so for them it is ncccssarv to devise observable 
dimensions that constitute evidence ot their ///lobservable dimensions 

3 To be mcasuicd a dimension must he death defif^cd This third essen- 
tial condition ot mcasuriblc dimensions is self-evident It to measure means 
to assign a symbol th it properly characterizes tlu status ot a phenomenon with 
respect to some dimension, it tollows that the dimension must be clearly 
defined Indcteiminatcncss in the dimension necessarily means mdetcrminate- 
ncss m status with respect to that dimension 

In education the clear definition ol dimensions is p irticularly critical since 
the phenomena with which we arc concerned too otten are interred states that 
derive as much from the viewpoint ot teachers and psychologists as they do 
from the pupils themselves Moreover the language used to denote and define 
the dimensions ol such things is subject achievement intelligence, study habits 
and citizenship usinlly is of the vernieular the words then have broad and 
varied connotations tot example the>c might be the dimensions of achieve 
ment as defined in a general science course ot study tactual knowledge 
critical mmdedness habits ot aceui ley understanding of relationships, and 
scientific attitudes It is exceedingly dilheiilt to know exactly which the many 
possible meanings tor inv ot these words is intended let alone to dilicrentiate 
one dimension cle irly fre^m an) othei 

So m arranging educational phenomena lor met surement, an essential first 
step oltcn must be to define their dimenMons in unequivocal terms As we shal 
see this generally means to define them in terms ot simple and observable 
actions and the observible altribut s ol the actions For example critical 
mmdedness” might need to be restated as point', out errors in evidence or in 
arguments Ironi evidence ” Vve can listen to or re id the crrois a pupil delects 
and we can count how mam and identilv what kind 

4 Ij a i^roup of phcnoimna art to In mcaswed with icspctt to a dinicn 
Sion they ttnisi differ with respeet to tin dimension Variation is inherent in 
the act of measurement In tact, dimensions often arc called vat tables We 
measure the intelligence of pupils becau‘'e they have diffetcnt decrees of in 
telligencc We measure their achievement in social studies because they exhibit 
different degrees ot achievement 

If all mcmbeis of a gioup are alike with respect to some property, its 
measurement has no significance For instance all pupils in attendance at 
school are alive Wc arc aware of no differences among them as to their degree 
of sentience, so teachers do not customarily measure how alne they are. 
Physicians and undertakers arc, ot course, concerned with tests for ‘llving- 
ncss” because they inevitably see those who arc dead 

We are most prone in measurement to think of the differences among 



PREPARING PHENOMENA FOR MEASUREMENT 


23 


phenomena as lying along some sort of a continuum or of constituting a con- 
tinuous variable. Intelligence, height, weight, subject achievement, motor skill, 
are examples of dimensions in which variation is thus a matter of degree. 
There are, however, other types of variation. We have dimensions in which 
variation is discontinuous. Enumeration (how many) is one of these and it 
pertains to anything that has parts or distinct units. For other dimensions the 
variation is not serial or a matter of degree, it is simply one of type or category. 
For example, there are dichotomous dimensions such as sex and niany-classed 
ones such as race, nationality, and political allilialion. 1 he variation in some 
categorical dimensions has to do with place in a hieiarchj or other systematic 
arrays; for example, the “specie” of a life form and the “somatotype” of a 
person’s physique. While such instruments as rulers and meters, which yield 
measurements aiong a continuous scale, arc not applicahle to these other types 
of vaiiation, procedures tliat record the ditlcrences in appropriate v\*iys are 
applicable, and the dimensions are measured only as the dilfercnces are de- 
tected. 

The lourth essential condition ot what wc can measure is, then, variability. 
Dimensions exhibiting the more easily disccinible and the more consistent 
dilTcrcnccs arc, ol course, the more precisely iiKasuiablc. 

5. Measurable dimensions must produce highly similar reactions among 
many unrelated and impartial ohseners. Finally, have you ever tried to meas- 
ure a ghost? Some persons say they have seen and heard them. 1 he> are de- 
fined rather clearly by these persons and they are said to differ with respect to 
size and temperament. The British Society foi Psychical Research has for 
some time been attempting to measure ghosts, but the results ol the measure- 
fnents arc not yet satisfactory even to many memlxrs of the Society 

On the other hand, have you ever tried to measure ;i niotion-piciure ghost 
cast on a screen by a projector*'^ II you were to tiy, )our appraisal as to height, 
luminosity, etc., probably would be aLcepted, as long as you used a rule, a 
light meter, etc. 

Now why is it that projected images of movie ghosts can be measured and 
ghosts themselves cannot be? They are both observed by some poisons. Ihcy 
arc both rather clearly defined, and they both exhibit variation. You say a 
ghost isn’t real, that ghosts arc just superstitions? We agree. But suppose 99 9 
per cent of all the people are blind and vmly 0 I per cent can see. 11 these O.l 
per cent see projected images and talk about them to the blind who, being an 
overwhelming majority, publi.sh all the scientinc lOurnaK, might not projected 
images also be thought a superstition? 

So, what essential characteristic docs distinguish the ghost from the pro- 
jected image as to measurability'^ It is thought to be dimply that many unre- 
lated and impartial observers agree essentially as to what they observe w^heii 
they observe a projected image; and, of great importance, cdl, or nearly all, 
persons with 20/20 vision do sec something when an image is projected. With 
ghosts, however, only related observers agree even a little, unrelated ones dis- 



24 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


agree a great deal, and impartial observers generally never see anything ex- 
cept where the ghost is found to be a projected image or an optical illusion. 

This final condition of measurability lor a dimension — impartial observers 
must agree in their reactions — has great importance for educational measure- 
ment. The appraisal of certain things is unreliable, even impossible, just be- 
cause teachers agree neither as to their dimensions nor as to the status ol 
pupils relative to these dimensions. Character, personalit>, citizenship, ap- 
preciation of art are among the phenomena that often fail to produce agree- 
ment among observers. Consequently, before one attempts to measure them, 
he must find or devise dimensions tor which there can be some consensus of 
reaction. 

In the preceding paragraphs we have developed five conditions of measur- 
ability. To recapitulate, measurement symbols can be validly assigned to 
phenomena in the degree to which their dimensions: 

1. Are possessed in common by a group ol phenomena 

2. Provide sensory data 

3. Arc clearly defined 

4. Manifest variation 

5. Yield equivalent reactions fiom unrelated and impartial observers 

In applying the second, third, and filth ol these cntciia to educational 
phenomena, it is necessary to consider that they lepresenl poles or extremes ol 
continuums, at which (Hher extremes arc conditions that pi event mcasiiiement 
In the middle of these continuums he conditions intermediate between ac- 
curate measurement and no measarement. For items lound here, it will be 
necessary to decide whether they are sufiiciently observable, defined, and pro- 
ductive of consensu*' in observers to be considered mcasuiable dimensions 
These considerations are made giaphic in the lollowmg chart. 

Obviously Olniouslv not 

Measinahle Measurable Miasiirahle 

Dimensions Dimensions? Dimensions 


DitTerentiatcd sense data 
Well defined 

All unrelated and impar- 
tial observers agree 


No or vague sense data 
Not defined 

No unu'Iatcd and impar- 
tial obseivcts agree 


In preparing dimensions of educational phenomena for measurement, it 
may be assumed that they will meet the conditions of being ipplicablc to a 
group and of manifesting variation. On the other hand, it should be assumed 
that they will only approximate the other three conditions and that many will 
fall so short of the conditions as to be unmeasurable in their given form. For 
this reason, the tests of sensory data, definition, and consensus among ob- 
servers must be rigorously applied to educational phenomena before their 



PREPARING PHENOMENA FOR MEASUREMENT 


25 


measurement is attempted. In every case dimensions should be so eonstrued 
that the maximum of extensive and appropriate sensory data is provided, that 
their definitions are as clear as possible, and that maximum agreement may 
be expected among observers. If in your view any dimension should still fall 
too short of one of the conditions for accurate measurement, its measurement 
should not be attempted; jor once a measurement symbol is applied to an in- 
dividual, there is a presumption that something has been measured. And we 
have attempted to show that nothing may have been measured unless the five 
conditions of measurability have been approximated. 

Obviously, these conditions arc interrelated. Where one is met there is 
likelihood that the others arc as well, and efforts to correct one may correct 
others. For example, clear definition is possible only with some degree of 
sensory data. It is possible, then, to indicate one general method ol defining 
phenomena and their dimensions so the three critical conditions of sensory 
data, definition, and consensus among observers may be approximated. This 
is: 

1. lo define the phenomenon in terms of its dimensions 

2. To define the dimensions in terms of specific behaviors and their ob- 
servable attributes 

This method is explained and stressed throughout the book in connection 
with discussions ol measuring procedures and the many focuses and uses of 
measurement in education. Here we will include only a simple application ol 
the process to illustrate its effectiveness. 

“Literary appreciation’' is one of the many aspects of pupil achievement 
whose evaluation continues to exasperate teachers. One reason for this situa- 
fion may be that “literary appreciation” often is not clearly defined and has no 
clearly indicated dimensions. To measure it requires first that it be defined in 
terms of its dimensions or properties. One such definition might be: Literary 
appreciation means the degree to which a pupil: 

1. Understands the stiucturc and symbolism of great writing 

2. Is aware of the social significance of poetry, prose, and drama 

3. Discriminates bctw^ccn popular and classic w'orks 

The basic dimensions of literary achievement, according to this definition, are 
the elements listed. But, as stated, none cf them is clearly defined, they offer no 
sensory data, and observers would argue about them. To redefine them so as to 
correct these shortcomings, wc must state them in terms of actions in which a 
pupil manifests his status re the dimensions. For the third dimension, “dis- 
criminates between popular and classic works,” such behavioral redefinition 
might be in part: '^Choice of titles for free reading; recall of titles, characters, 
and situations in conversation and writing; comparisons of current writings 

^I'herc are many other possible definitions of literary appreciation. An excellent one 
is published in the Thirty-Seventh Yearbook of the NSSE, Pait 1, Bloomington, Illinois, 
Public .School Publishing Co., 193H, pp. 114-115. Sec also Chapter 10, page 254 here. 



26 FUNDAMENTAL CONCEPIIONS AND PROCEDURES 

with those deemed classic.” This redefinition seems to be reasonably clear. 
It is possible to listen to or to read what titles pupils recollect and what 
comparisons they make. Moreover, observers are likely to agree about the 
choices, recollections, and comparisons. Finally, pupils’ status in respect to 
these operational dimensions is a lunction of how many classic versus nonclas- 
sic titles were read, how many classic versus nonclassic titles, characters, and 
situations arc recalled, and how many and whai kind of comparisons do they 
make between current and classic writings. \ hus, in counting how many and in 
stating what kind, v\c indicate the degree to which the pupils manifest this given 
aspect of literary appreciation. 

Inferred Dimensions and Constructs 

We have discussed the tact that measurement may be directed only to ob- 
servable dimensions. Yet, as we have noted, many ol the things that educators 
need most to appraise arc abstractions about behavior or arc covert, unobserv- 
able states of mind and feeling. Often to these so-called “intangibles" arc 
ascribed properties that are themscKcN unobservable. Scientifically speaking, 
any such properties constitute inferred dimensions vSince inferred dimensions 
provide no sensor) data, the) cannot be measured directly. 'Ihcy can, howcvci, 
be measured indirectly by finding or devising observable dimensions that are 
related to the inferred ones 

Intelligence is one of the man) educational phenomena thatliavc interred 
dimensions. It is customary to speak of inductive reasoning as an aspect (or 
dimension) of intelligence and mod intelligence tests attempt to measure a 
child’s ability to reason inductivcl). Yet “inductive reasoning” as such is not 
observable. We haven’t as yet been able to get inside the mind and dircctl , 
measure w'hat goes on there when a person reasons. Hence, the dimension is 
an inferred one. We measure it by observing a child at tasks that we say 
require inductive reasojiin^: tasks Mich as making up rules to explain what has 
happened, solving codes, woiking out nilios, etc. Sometimes an observed di- 
mension is measured so olten to detciimnc the same interred one that wc 
become interested in it lor itself We ticat it as li it rcall) belonged to the 
phenomenon. 1'hosc who give individual mental tests talk about “digit span.” 
This is how' many numbeis a child or an adult can recall, having heard them 
just once. By continued usage, “digit span” has become nearly as much a 
dimension of intelligence as the ‘memor)” to which it lelalcs. 

Inferred dimensions may be mcasuied as accurately as any if they arc 
properly inferred and if the related dimensions to be directly measured arc 
properly chosen. Because of this we are able to evaluate pupils’ knowledge, 
their personality, their intelligence, and their attitudes, all of which arc them- 
selves unobservable. In (dui|ner 6 and in Chapters 1 1 and 12 wc shall see that 
the construction of achievement tests is largely a matter of keying the observ- 
able responses of pupils to test questions on the one hand to the inferred 
dimensions of their knowledge of school subjects on the other. 



PREPARING PHENOlVlENA FOR MeAuREMENT 27 

Inferred dimensions often belong to a type of phenomena called con- 
structs. Intelligence and personality, just mentioned, both are constructs. So 
are the atom, radio waves, the subconscious mind, and many other things. A 
construct is an explanation, not a thing. It is not a rock; it is a geologist's 
theory of the molecular structure of that rock. 

Very simply, constructs are symbolic maps where words and their interre- 
lationship, numbers and their interrelationship, represent the structure or 
process of unobservable biological and physical states. They arc based on and 
serve to explain the observable data for w^hich we assume some unseen or 
underlying causation. Like maps, constructs summarize; they boil things down 
to a small scale. Like maps, constructs leave out details, they abstract, omit, 
and select. Neither maps nor constructs need to look like the terrain or struc- 
ture they explain; they use conventional symbols. 

Finally, both maps and constructs facilitate predictions about observable 
things. Open a highway map of Missouri and see how far it is from Kansas 
City to St. Louis: 257 miles. At 50 miles per hour, with allowance tor stops 
and traffic, that should take 6V2 hours. So tomorrow you allow 6V2 hours to 
go to St. Louis from Kansas City and you (*et there in 6V2 hours. With some 
constructs, electronics for example, one can make fantastically accurate pre- 
dictions. Vacuum tubes arc built to make a certain pattern on the screen ol 
a specific television set and they do just that, all on the basis ol an engineer’s 
specifications. Yet the engineer need never see the tube, nor the set, but only 
some electrical and mathematical symbols on paper. With some constructs ol 
educational significance, there is much less accuracy in predictions. Knowing 
that ten children aged six have IQ’s of 90 and ten other six-year-olds have 
JQ’s of 1 10, we would predict that the latter group, as a group, will learn to read 
taster and read more in the primary grades than the former group. We order 
books and plan instruction accoidingly, and our prediction is fulfilled. How- 
ever, for just one child of low IQ and one child of higher, our prediction 
would be less confident and our mistakes more frequent. Unfortunately, our 
current constructs of intelligence are sometimes like the maps of the ‘'New 
World” in the time of Amerigo Vespucci According to legend, with his maps 
as navigational guides, it was not at all difficult to miss a continent, let alone 
an island. 

The inferxed dimensions of a construct are the pioperties wffiich the con- 
struct needs to have to explain the observable data to which it relates The test 
of their validity is the accurae> of the predictions they support. For example, 
“memory” is a valid dimension of inklhgcnee to the extent that accurate 
predictions may be made about the academic success of pupils as the result of 
measurements of their ability to remember. 

Even though an inferred dimension is a valid one, it still needs to be 
measured accurately and this first entails selection of appropriate observable 
dimensions. With respect to “intelligence” and two of its dimensions, memory 
and reasoning, history has shown us what are and what are not appropriate 



28 FUNDAMEN|:AL AND PROCEDURES 

observable dimensions. In the eighteenth centur^, and even in the nineteenth 
century, it was not uncommon to assess memory and reasoning by measuring 
the contours of the skull, by observing how high was the forehead or how 
wide-set the eyes, sometimes even by asking the date of birth and the con- 
current status of the zodiac. Now, to test for memory and reasoning, wc ask 
children to recall numbers, words, and pictures and to do problems in arithme- 
tic, work puzzles, run mazes, and the like. 

Principles in the Selection of Dimensions 

For the most part, the phenomemm to be evaluated will indicate the di- 
mensions to be measured. If it is spelling ability what and how many words 
can the pupil spell? If it is a question of facility at wiring in a shop situation, 
the dimensions are how lon^ the student requires to complete the job, the 
errors in his connections, the quality of his soldering, etc. Sometimes, however, 
the phenomenon itself docs not immediately disclose the dimensions appro- 
priate for measurement. In such cases it is well to have in mind certain prin- 
ciples that are relevant to their selection. 

In the first place, dimensions should be selected that relate directly to the 
purpose of the measurement. For example, the sponsor of the school paper, 
who wishes to pick the student stall, might wish to measure the variety of 
adjectives and adverbs that prospective reporters can use. On the other hand, 
the secretarial teacher, in choosing students to do sienogiaphic^work in the 
school, might not be interested in such a factor as this but perhaps only in the 
frequency of giammatical and spelling errois. Yet both teachers would be 
measuring the same basic thing, composition. 

Secondly, dimensions should be selected in tcims of the precision required 
Any measurement purpose entails a minimum degree ot precision in results 
if the purpose is to be accomplished. Precise appraisal is partially a function 
of the dimensions measured and conscqucnll> wc need to select the dimensions 
that give the required precision, lo illustrate this point, if the third-grade 
teacher needs only to distinguish three levels of reading ability relative to the 
Readers she uses, she then may be concerned only with rale and comprehen- 
sion for this material. The reading diagnostician, however, needs to know the 
exact status of each pupil for all types of reading; so he appraises the dimen- 
sions of rate and comprehension for each of several types of material, and 
additional dimensions (recof^nition techniques, sii^ht vocabulary, etc.) as well. 

It sometimes occurs that the measurement task may not be accomplished 
satisfactorily through the measurement of any dimensions currently known 
for the phenomenon. It is proper and essential, in this event, to find or invent 
new dimensions through imagination, experimentation, and theorizing. New' 
dimensions are commonplace for the physical .scientist. In fact, science has 
advanced as scientists have had the temerity to name and the tenacity to 
measure new properties. Think of Curie and “radiation,” Newton and “grav- 
ity,” and some unknown ancient sea captain and the “polarity” of the earth. 



PREPARING PHENOMENA F^R ^EAS^^MENT 29 

In education, the finding an(fin'9cnting of new dimensions is less remarkable, 
but a scientific approach to education is a very young idea. However, all these 
very useful dimensions of pupil behavior are new: intelligence, needs and 
drives, social status, group cohesiveness, ego defenses, aggressiveness, com- 
pensation mechanisms. 

Finally, dimensions should be selected in light of the standards to be used 
in evaluation. We observed in the overview to this section that effective evalua- 
tion requires the comparison of a phenomenon — achievement in English, 
knowledge of history, swimming ability, anything — with a standard. The 
standard normally consists of one or more levels or classifications of status to 
which arc assigned one or more symbols of value or worth For example, an 
evaluative standard in typing might be: “Less than 20 words per minute, un- 
satisfactory; 20 to 40 words per minute, fair; 40 to 60 words per minute, good; 
60 and more, professional.” In Chapter 9, the development and use of evalua- 
tive standards are discussed in detail, and in Section II attention is given to the 
standards employed in various school subjects. 

Obviously, a standard must be expressed in terms oi one or more dimen- 
sions of the phenomenon to which it refers. It follows then that measurement 
that is to lead to evaluation uuist relate to precisely the same dimensions as arc 
cited in the evaluative standard. Any less will make the evaluation indeter- 
minate or impossible; any more will be superfluous. In the typing example, you 
could use the standard only if you ga\e a speed test or otherwise measured the 
dimension of speed, and there would be no point to measuring accuracy if it 
were omitted from the standard. Obviously, if the standard referred both to 
speed and accuracy, both ol them would need to be measured 
• 

Some Dimensions Common to Most Educational Phenomena 

In later chapters we will cite many of the varied dimensions of specific 
educational phenomena Here it may be useful to mention several general 
dimensions likely to be common to most of them. Knowledge of these should 
help in understanding the part cular dimensions given for some specific sub- 
lects and m determining the measurable dimensions of others. 

Since all educational phenomena have parts or aspects, the identity, num- 
ber, and on^amzotion of tomponents are common and important dimensions. 
In measuring a student's ability in composition, for e\am[)le, the types of sen- 
tences, clauses, and phrases he us^^s arc to be identified, the number of varied 
constructions is to be dcteimined, and the ways in which paragraphs arc 
organized is to be noted “Tests” in history and geography invariably seek to 
probe what facts (identity of components) and how many facts (number of 
components) a pupil knows and how he has organized these facts (pattern 
among components). 

A fourth common dimension of educational phenomena is time. How 
long this behavior has lasted is a frequent concern of the teacher. Speech 
teachers give attention to this dimension as they listen critically to a student’s 



30 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

talk and tell him to sustain the sound of given words longer or to speak more 
crisply. 

Rate or speed, a fifth common dimension, is well known to reading and 
business instructors and is formed of two other dimensions, time per given 
component. Forty words per minute in typing, fast tempo in music, 300 words 
per minute m reading, are examples ol measurements of rate. A sixth common 
dimension, frequency, is an enumerative counterpart of rate. Examples of the 
dimension arc: number of errors per 100 words, number of pupils who arc 
twelve years old, and number of times Bill or Mary was disorderly this week. 

Finally, error itself is a dimension common to pupils’ achievement in all 
school subjects. Certain ways of spelling words, certain answers to multiplica- 
tion problems, certain name-date associations, and a myriad of other things 
are called errors because they are all departures from how it is or how it should 
be. Since rectitude of performance is a primar> goal of education, and since 
pupils have never ceased to depart thcrcirom, ctror must be designated as a 
general dimension of achievement. 

The choice of dimensions ol pupil achievement in those school subjects 
that wc call areas of knowledge or understanding poses a particular problem 
While there is a general agreement that scicikc, history, geography, etc, 
should be taught, there is less agreement on what should be taught in their 
name. And there is practically no agreement on what should be measured 
when a student’s proficiency in any knowledge subject is to be judged. Much of 
this difficulty may stem from the fact that the ‘"knowdedge” must be an at- 
tribute of the pupil, if we measure him rclati\e to it On the other hand, that 
which wc think of when w^o try to measure “knowledge” often is apt to be the 
content of the books and other documents comprising the “knowledge” wc 
try to teach him. The books contain words and other symbols arranged in 
logical and static ways, while the pupil contains action and feeling tendencies 
arranged psychologically and dynamicall>, if he may be said to “contain” any- 
thing at all 

However the problem arises; its solution requires that any subject of 
knowledge or understanding be clearly analyzed from the viewpoint of meas- 
urable dimensions before an> measurement is undertaken I he general dimen- 
sions we have just discussed oiler a starting point in this analysis The possible 
components of the pupils’ knowledge, their idcntitv, number, and organization 
may be construed as the expressed facts and concepts, laws, etc., which the 
subject encompasses and the logical or chronological relationship among them. 
The dimension of error could refer to errors in the pupils’ grasp of such facts, 
concepts, laws, etc., as against their representation in books and other rep- 
utable documents. 

In addition, though, if understanding of something is to be measured ade- 
quately, dimensions should be selected in terms of the psychology of the pupil, 
not just in terms of the logic of the subject. The pupil learns the elements of 
the subject in various ways: he relates them to other experiences; he applies 



PREPARING PHENOMENA FOR MEASUREMENT 


31 


them to new situations, modifying them as he does so. His knowledge at any 
moment is a function of his mind in action. It is not simply a reflection of some 
portions of books and lectures with an occasional distortion. 

Consequently, pupil knowledge or understanding of any subject necessarily 
has such dimensions as interpretation, application, and synthesis of facts, as 
well as that of facts recalled. It has dimensions that have to do with the 
hierarchical arrangement of ideas as well as their number. 

Later in the text a number of school subjects arc examined in the light 
of this viewpoint toward the dimensions of knowledge and understanding. A 
recent remarkable publication, Taxonomy of Educational Objectives (2) pre- 
sents a general classification of educational goals tantamount to a statement of 
the possible basic dimensions of student achievement. I'hc various classes of 
goals (or dimensions) stated for knowledge and understanding are given in 
terms of pupil thought and action and not just the logic of a “subject.” Fur- 
thermore, the authors describe appropriate methods ol testing to determine the 
status of students in respect to each of the goals. While it may have some flaws, 
this taxonomy could be an extremely useful basis for devising the measurable 
dimensions of achievement in particular subjects. The categories the taxonomy 
presents for the “cognitive domain” are as follows: 


Knon'ledqe 


1.0 
I.JO 
1.12 
1.20 
1.21 
) 22 

1.23 

1.24 

1.25 

1.30 

1.31 

1.32 


Knowledge 

“ of specifics 

“ of specific facts 

“ of ways and me.ins of dcalii'.g c\iih specifics 
“ of conventions 

“ of trends and scqii *nccs 

“ of classifications and categories 

“ of criteria 

“ of methodology 

“ of the univ* rsals and ab.stiactions ni a field 
“ of principles and generalizations 

“ of theories and stiiictiires 


Intellectual Abilities and Skilh 


2.0 Comprehension 

2. 1 0 'lYansIation 

2.20 Interpretation 

2.30 Extrapolation 

3.0 Application 

4.0 Analysis 

-^t-lO Analysis of elements 

4.20 Analysis of relationship^ 

4.30 Analysis of organizational principles 

5.00 Synthesis 

5.10 Production of a unique communication 

5.20 Production of a plan, or proposed set of operations 



CHAPTER 3 


PROCEDURES OF MEASURBUENT IN GENERAL 


Wc now have discussed two major aspects of the measurement of educa- 
tional phenomena. In the first chapter wc examined the forms of symbolic 
expression that measurements may produce: classification-description, rank 
or order, and scaling. We determined in the second chapter that symbols 
(numbers or, if not, words and letters with precise definition) appropriate to 
any of these forms may be assigned to any phenomenon you wish to meas- 
ure — but only as long as the phenomenon is measurable. 1 o help gauge the 
measurability of phenomena, five conditions of measurability were established 
If it is to be measured, a thing must have one or more dimensions (proper!), 
quality, aspect, etc.) in common with other things. These dimensions must 
provide us with some sensory data, be discrete and well defined, manifest 
variation, and produce highly similar reactions among unrelated and impai- 
tial observers. It was noted that the measurement of physical phenomena 
rarely involve the application of these conditions, so well do the familial 
physical dimensions meet the conditions. On the other hand, behavioral 
phenomena, the things of primary interest to teachers, must invariably be 
examined for measurability before their measurement is attempted. 

At this time, wc shall give attention to the third and most tinic-eonsuming 
aspect of measurement; the procedures by which measurement is accom- 
plished. If we were concerned with measuring speed, height, weight, and other 
physical dimensions, we would start to talk about speedometers, yardsticks, 
balances, photometers, calipers, spectrometcis, etc., and the techniques for the 
use of any of them. Since, though, wc are to deal exclusively with behavioral 
phenomena, our attention will be on the devices we use to appraise behavior: 
tests, rating scales, observation schedules, and the like. In the four succeeding 
chapters we will present the nature and significance of the following proce- 
dures together with principles for their efficient use: observation, product 
analysis, free-response procedures, and guided response techniques. Before 
we turn to them, however, it is necessary that we understand something about 
behavioral measuring procedures in general. 

Function of Measuring Procedures 

When, as teachers, wc give arithmetic or science tests to our pupils wc have 
to score the papers. We count the number of answers wrong or right and place 

34 



35 


PROCEDURES OF MEASUREMENT ts GENERAL 

a corresponding number or letter on the papers When, in an English class, wc 
ask the pupils to write a short composition so that we may see how good is 
their usage, we similarly mark their submissions and place an appropriate 
number or letter on the compositions Now, this number or letter or word is 
the measurement Ihc purpose of the tests and of the writing of the cornposi- 
tions has been only to obtain these numhei s or letters 1 hus, the burden of a 
measuring procedure is the assignment of proper measurement symbols These 
characterize the status of the pupils in regard to anthmetic, science, or com- 
position, which was our purpose in measurement 

Obviously then, the things wc may be accustomed to think of as constitut- 
ing measurement — truc-talse tests, blue books, ratine forms, and all the other 
paraphernalia ol educational measurement — arc only the means ot measure- 
ment It the proper measurement symbol may be assigned w ithoiit i sing the 
piocedwc, the procedure is ot no use If a student's achievement in history can 
be symboli/ed as validl> reliably, and eflicicntly by just listening to him recite 
as by testing him, there is no point to testing him On the other hand, wc shall 
s^e presently th it the assignment of the proper mcasuicmenl symbols usually 
necessitates some sort ot set procedure Hence, our tests, our handwriting 
scales our interest inventories, and our aptitude batteries arc indispensable 
to measurement if they are not svnon\mous with it 

Basic Properties of Procedures 

The fact that the phenomena ot primary interest in education are behayiors 
oi psychological states poses problems lor measurement procedures It is no 
trick at all to assign the proper scale number to a boy's height Stand him 
gdongside a rod or wall lined off into Icct and inches and read the point with 
which the top of his head intersects Five lect, one inch and that s all there is 
to It’ But to measuie his intelligence, his knowledge of anthmetu or his atti- 
tudes toward school may require more complicated devices and a gieat deal 
more ingenuity Bchavioial oi psvchological cntitiLs are usuall> processes 
man> of ihcir components arc unseeable thc> do not stand still or have an\ 
mass, they change \er\ iapidl\ .ind thc\ amount to more thin tan be meas 
ured at any one time. 

To accommodate to these conditions and yet to produce reasonably accu- 
rate measurements, the procedures of bel avioral measurement cmplov standard 
stimulations, the) attempt to licit standard dijlcicniud responses they use 
standard anah SIS s) stems and they engage m sampluv^ Sampling and stand 
ard analysis are inherent in all types ot proccvlurc but the other devices may 
or may not be the property ol a given procedure 

STANDARD STIMLIAFIONS 

The obvious characteristic oi what you know as a test is that it prc^vidcs 
the same stimulation to all pupils Both the truc-talse question and the essay 
question are intended to provide the pupils who read them with the same 



36 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


sensation so that their different responses may be compared. Obviously, unless 
the pupils respond to the same thing, the significance of any comparison made 
among them is unknown. And, as we know, comparison among pupils is 
necessary in many instances of behavioral measurement. 

What standard stimulations are presented is, of course, determined by 
what responses arc desired and these in turn by what dimensions are to be 
measured. For example, the statement, “Columbus discovered an island in 
the Caribbean Sea in 1492,” probably is used to elicit a “Yes” or a “No” from 
pupils and the ‘'Yes” or “No” is desired because it is thought to be keyed to 
the jacts known by a pupil about American History. Ihc purpose of the ques- 
tion, “What IS the relationship between on-shore winds, coastal mountains, 
and rainfall?” is to obtain a sentence, a paragraph, or more from students 
that should reflect their dillerent understandings of the interrelationship 
among climatic factors. 

The types of standard stimulations employed in educational measurement 
are many. They will be presented in connection with free response techniques 
and, in particular, with guided response techniques. Whatever procedures are 
employed, behavioral measurement requires the occtirience of behavior or 
evidence of a behavior, and standard stimulations arc devices to obtain com- 
parable behaviors or evidences thereof from a number of pupils. 

SIANDARD DIFl I REN'JIAL RESPONSl S 

If the most obvious characteristic of a test is its inclusion of standard stim- 
ulations, a second, but equally important characteristic, is its attempt to elicit 
standard yet diflercntial responses. The necessity of differential responses oc- 
curring is axiomatic. Wc established earlier (Chapter 2, page 22) that u 
measurable dimension is one in which variation is manifest . Consequently, any 
procedure for measuring a dimension must be sensitive to or productive of 
differences. The least differential rcNponses are obtained b> the very familiar 
true-false items; the most, by such guided response items as essay questions 
and tcll-mc-a-story pictures. 4 he truc-lalse variation is only twofold, the least 
possible, while, within limits, no two responses to an essay question are ever 
identical and, hence, the variation of responses to them is as great as the num- 
ber of students tested 

By “standard” is meant simply that the same words or actions given by 
different pupils are to mean the same thing and that the options of response 
possible in a given situation are known and have predetermined significance. 
Multiple-choice questions exemplify test items that elicit highly standard 
responses. Consider the following item: 

A. The achievement of the G reefs was relatively gieatest in: 

1. war 

2. social reform 

3. government 



PROCEDURES OF MEASUREMENT IN GENERAL 


37 


4. laborsaving machinery 

5. navigation charts 

A pupil may answer this only by placing a number, 1 to 5, in the space 
provided, and thus the nature and number of possible responses are strictly 
controlled. One of the numbers means that the pupil has correct knowledge 
concerning this aspect of Greek civilization. All others mean tliat he has in- 
correct knowledge or no knowledge (guessing). Hence, the options have pre- 
determined significance and they arc assumed to have the same meaning no 
matter which pupil uses them. 

Essay ciuestions, on the other hand, produce responses much less stand- 
ardized. The words used by different pupils are assumed to mean the same 
thing but the significance of dilTercnt types of response is only roughly pre- 
determined and the number and nature of possible responses is unknown. As 
we shall see presently, this relative lack ol standardization tends to make 
essay questions and other free response items more dilhcull to score and in- 
herently less reliable than guided response items. 

Obviously, the more standardized arc the answers to test questions, the 
more comparable arc the answers given and the more impersonal their scoring 
'I rue-false, multiple choice, matching que.stions, and other guided response 
items yield test scores that may be ranked and can be marked mechanically or 
electrically, the ultimate in objectivity. The answers to cs.say questions given 
by several pupils arc not as amenable to ranking and they are not susceptible, 
to machine scoring. Even with a well-de\ised system for marking, the examiner 
must make many judgments as to the meaning and significance of phrases 
^ Standardization of responses is gained, though at the expense of dilTcr- 
entiation. For attributes wherein pupils are known to manifest great variation 
(intelligence, personalhy, musical ability, etc.) objective tests wath their rigidly 
standardized responses have limited usefuhies’^. 

STANDARD ANALYSIS S^SIrMS 

I he sine qua mai of procedures of educational measurement or any meas- 
urement is a standard analysis system. Some arc very simple, the “key" to a 
science test or an observation sheet with captions and space for writing. Others 
arc very complicated, a tactor-count schedule lor a projective personality test 
or the scoring standards lor an individual intelligence lest But, simple or com- 
plex, the function of a standard analysis system is to insure that the same 
measurement symbols arc assigned to phenomena of similar status; that differ- 
ent symbols arc assigned to phenomena of dilferent status; and that all phe- 
nomena purportedly measured in regard to a given dimension arc each meas- 
ured in the same way for just that dimension. 

To this end varieties of techniques are employed for detecting, recording, 
classifying, counting, and comparing the proper dimensions of the behavioral 
phenomenon being measured. An observation check list helps the teacher look 



38 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


for and, thus, detect in a systematic way th^i^il actions he considers relevant 
to citizenship. The record forms and rating olanks used in individual intel- 
ligence and personality testing arc simply the means by which the examiner 
records his observations of all those he tests in a standard fashion. A test key 
is a good example of a classifying device. It permits the automatic assignment 
of items to categories of right and wrong, or whatever other categories are 
desired. An electrical scoring machine not only classifies responses automati- 
cally, but also counts the number of responses in each category. Perhaps the 
most familiar illustration of a comparison device is a Handwriting Scale. In 
this, as you doubtless recall, graded samples of penmanship arc provided tor 
comparison with a specimen written by a pupil. Different samples are tried 
until the one is found that is most like the pupil’s specimen I he specimen is 
then assigned the rating of that sample. 

Additional devices frequently arc used to convert verbal data into numer- 
ical data, to convert raw or first numbers into those having class, rank, or scale 
significance, and to derive measurements of a group from the measurements of 
the individuals within the group. The latter two conversions arc the burden 
of the statistical procedures described in Chapter 7. The tables and profile 
sheets of standardized achievement tests (see page 125) are the means by 
which a teacher conveits the number of qucstains a pupil answered correctly 
(raw score) into a grade index, a percentile equivalent, or sometimes a scale 
number. 

SAMPI INC 

The last general feature of measuring procedures we need to consider is 
that of scunpl'nv^. The idea ol sampling is simple. You measure a small portion 
of something and then make statements about all ol it But the implications ol 
sampling for measurement are man}, varied, and not alwa\s obvious. Iheie- 
forc, to explain the concept properly, wc shall use a very tangible illustration. 

Suppose that a housewife, who has just bought an apple pic from a small 
bakery, eats a piece and then savs to her husband and children, '‘This is a 
good pie, have some.” The woman has made a declaration about all the pic, 
but she has tasted only a piece of it. Since a pie made by a bakery tends to be 
uniform throughout, it is highly probable that her husband and children will 
find it good also. But it is not certain that they will do so. T he side she tasted 
may contain more or less ol the fruit than the other side. Her piece may have 
received more or less heat, and so on. 

Now suppose the housewife does find all the apple pie to be good, and 
because of this she considers the idea of buying all the pics the bakery has 
produced that day (which happens to be Tuesday) Will that be wise? Perhaps 
the bakery’s bciry pies contain overripe fruit, perhaps in their meringue pics 
powdered eggs have been used. 

Finally, suppose that the housewife docs buy the bakery’s entire output 
of pies for Tuesday, stores them in her freezer, and eventually she and her 



PROCEDURES OF MEASUREMENT IN GENERAL 


39 


family eat them all and find thcm^niformly delicious. Should she then decide 
to buy all this bakery’s pies eacir^esday? She should only if she can be sure 
that the oven’s temperature next Tuesday will be at exactly the same tempera- 
ture as this Tuesday, that the dough will be mixed just as long, that the fruit, 
custard, and chocolate will be just the same in quality, and so on. Obviously, 
she cannot be sure. 

In this illustration we may observe the several facets of sampling and 
from our observation we may derive some basic principles that affect all meas- 
urement procedures. Figure 1 shows them in graphic form. 


Small Samples and 

O O Q/O 

O O (0 O)'* — Somple — 

oooo 

oooo 

oooo 

Produce inaccuracies in predictions about the whole of anything 


^ — ,^on-represenlative Samples 


o 

o 

o 

© 


m 

# 


Categorically 



Samples need to be lorge enough 
ond representative enough both 
ond 


Temporally 


Mon 


“"Sample - 


Tue. 


Wed. 



Figure 1. Principles of sampling 


1. Appraisals that involve sampling are estimates or predict urns otily . 
Until the housewife has eaten ail the pie, her statements about it are subject to 
error. 

2. Estimations based on sampling are least accurate when the sample is a 
small proportion oj the whole (e.g , one sixth of the pie) and when the sample 
is not representative (e.g., only apple pic eaten, but berr}^ custard, etc., to be 
purchased). Conversely, estimations based on proprotionately large samples 
(five pieces of the pic) and on representative samples (all varieties tasted) 
are most accurate. 

3. Sampling may be categorical (a fraction of a pie, one or several pies 





40 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

out of many). It also may temporal (Tuesday’s pies as a sample of pies 
baked during the year). Both kinds bf sampling usually are involved in educa- 
tional measurement. 

Educational measurement entails sampling for several reasons. First, the 
phenomenon to be measured may consist of an unlimited number of factors. 
A pupil’s intelligence or the achievement of an eleventh-grade class in physics 
are examples of such phenomena. Second, to measure all the cases in a given 
population ^ may be too costly or time-consuming. For example, to gauge the 
home conditions of each pupil in a large city school district would require a 
large staff of sociometrists and thousands of dollars. Finally, the act of meas- 
urement may consume the items that are bcii:g measured and, hence, only a 
sample may be used. The chemist with his flame and precipitation tests for 
unknown compounds exemplifies this reason for sampling. 

The degree of indeterminacy inherent in measurements because of sam- 
pling may be stated mathematically. 1'hc relatively simple mathematical pro- 
cedures available lor determining such “sampling errors” are explained in 
Chapter 8, pages 169-172. Ihe specific applications of sampling to the several 
types of measuring procedure arc presented later in context with the proce- 
dures. 

Criteria for Procedures 

Just as we found that a dimension to be measured must^first be gauged 
for its measurability, so w'c should rccognbe now that a proccdine of measure- 
ment should not be used unless it is judged adequate for the purpose. Ob- 
viously, everything called a test docs not lest wath equal accuracy in the 
experience of each of us have been instances when wc felt that a test was ur 
fair, too short, neglected certain aspects of the course, etc Equally distasteful 
in our recollections are the occasions when wc ourselves judged people 
wrongly or sized up situations improperly on the basis of faulty observations. 

As a basis for discriminating between relatively good and relatively bad 
procedures, we may use three criteria: validity, reliability, and efficiency. 
These, furthermore, are the criteria that should govern the construction of 
any test or the devising of any other procedure of measurement. 

VALIDITY 

According to the customary viewpoint, the capability of a test or other 
measuring procedure to measure w'hat it purports to measure is called validity.^ 

1 Population is used here to mean any aggregate of items — not just a group of people. 

2 Another approach to validity has been expressed by Cronbact (4:48). “A test is' 
valid to the degree that wc Know what it measures or pretlicts ” Froni this point of 
view, the measurement focus of the test is indeteiminate until we inspect its Items or 
until we see with what other measures its results correlate. The title or announced ob- 
jective of the test is of less importance. From this point of view, however, the piactical 
consequences for the ordinaries of school measurement seem to be about the same as 
they arc for the first and more usual definition, i.c., concern over factorial purity, clear 
definition of dimensions, elimination of irrational response determiners, etc. 



PROCEDURES OF MEASUREMENT IN GENERAL 41 

The extent to which it does this is its degree of mlidity and when such degree 
is expressed by a number it is called a Coefficient of validity/^ For example, 
if a test of “mathematical achievement” dilTcrcntiates among pupils on the 
basis of their difTcring skill and knowledge in mathematics and that only the 
test is a valid test. If another test of mathematics measures mathematical. skill 
plus something else, intelligence, let us say. the latter test has a smaller degree 
of validity than the first. 

Validity in a measuring procedure is a 1 unction of several things. 

1. Before any consideration is given to the procedure itself, validity in- 
volves a clear definition of the phenomenon and its dimensions for which 
measurement is to be sought. This description should so state dimensions that 
their measurability may be examined (see Chapter 2) and their special at- 
tributes identified. It must be entirely clear whether they are directly observ- 
able or matters of inference only, whether items of recollection or of present 
feeling, wlielher aspects ol a behavior or a construct about behavior, etc Fur- 
thermore, the definition should be in such terms that appropriate procedures 
and forms lor measurement can be devised directly from the definition. This 
means that the definition must be in terms of behavior and not abstractions. 

2. Because bchavic^ral measurement too frequently occurs under artificial 
circumstances, the concept of the simulated task or prototype oehavior has 
developed Obviously, if the phenomenon to be measured is writing ability, 
then the pupil should be measured while he is writing, ff this is not possible, 
he may be measured doing things that resemble or simulate liis ordinary writ- 
ing behaviors or that embody the essential actions of any writing behavior. 
The more the behavior during measurement approximates the actual behavior 
fior w'hich a measure is desired, the more valid is that procedme Jf even a 
simulated behavior is impossible, then responses must be contri\ed for meas- 
urement that arc correlated to tliL highest degree possible with the behavioral 
phenomencm in question. A good mathematics test may have high validity on 
the basis of employing simulated tasks. Even the best personality test has to 
base its claim to validity on the second approach, \ hich we will call indirect 
measurement. (For examples of simulated task tests sec Chapter 13, pages 
347-350, in particular.) 

3. A third variable in validity is the I'xtent to which the mca.^unng proce- 
dure appiaiscs in proper proportion and perspective all the significant dimen- 
sions of the phenomenon being measured. For example, reading involves rate 
and comprehension for varied materials and for varied purposes. A procedure 
for measuring reading ability that involved comprehension, one type of ma- 
terial and reading lor one purpose only, memorizatiim <ay, would be less valid 
than a procedure that measured rate also, used three dillercnt materials (text- 
book, fiction, and newspaper) and entailed reading for purposes of searching 
and skimming as well as for memorizing. If one or more of the dimensions 
consist of components or parts and these are in any way extensive, the mcasur- 

^ Statistical expressions of validity arc examined in Chapter 8, page 186. 



42 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


ing procedure must engage a sufficiently large and sufficiently representative 
sample ot these components. This is the major variable in the validity of 
vocabulary tests where extensiveness of vocabulary is the chief dimension. 

4. A fourth consideration in validity involves the extent to which the 
results obtained by the procedure are produced by the dimensions purportedly 
being measured and not by other unpredicted variables. Suppose that a general 
science test had many unnecessarily long sentences and employed many un- 
usual words. Very good readers would make higher scores than average 
readers, even if they had an equivalent knowledge of general science. Thus, 
this lest would be invalid to the degree that differences in reading ability af- 
fected the results. Reading is a very usual unpiedictcd variable that tends to 
invalidate measuring procedures to some degree. Other common ones arc test 
anxiety, differences in intelligence and motivation among pupils, and bias 
on the part of the examiner. 

Each type of measuring procedure has, of course, certain unique problems 
relative to validity and for each thcie are specific means of increasing validity. 
These will be presented when each procedure is discussed later in this chapter. 
Ways of estimating and expressing the validity of procedures are treated in 
Chapter 8, page J86. 

RLLIABILn Y 

The second criterion ot adequacy in a measuring procedure is reliability. 
Reliability means that the procedure measures consistently and unitorinly over 
the duration of the procedure and in a rcapplication of the procedure. A steel 
ruler is a good example ol a thoroughly reliable measuring device. No matter 
when or where you use it or how you hold it, its inches are the same length and 
the same distance is measured to be the same time after time.' 

A “fisherman’s” ruler and a rubber ruler are examples of unreliable 
measuring devices. The “fishei man’s” ruler looks something like this: 



It is fine to make a short fisii long enough to talk about, but not very good for 
measuring anything else. The trouble with it is that its inches are not of uni- 
form length. If you measure at the short inch end, and inch co ild be as long as 
eight inches. Thus, the same distam e will appear to vary as to length accord- 
ing to what part of the ruler you use. A rubber ruler, on the other hand, may 

4 Slight deviations in readings due to varied positions of the eye, differences in light- 
ing, and the expansion-contraction of the metal do occur. They arc so slight, however, 
as to be negligible for most purposes. 





PROCEDURES OF MEASUREMENT IN GENERAL 


43 


have inches of equal length but all of it or any part of it may be stretched. 
Hence, depending upon how tight or loosely the ruler is held during measure- 
ment, distances apparently become shorter and longer. 

Analogous to the ‘TisKerman’s” ruler is a test uneven in quality (typog- 
raphy, directions, vocabulary). The student who happens to know the answers 
to the good portions of the test is favored, whereas the student whose knowl- 
edge relates mostly to the poor parts is hampered. 1 he rubber ruler test is one 
in which many questions arc ambiguous or whose directions arc subject to 
varied interpretation. In January a pupil may define the words one way and 
interpret directions accordingly. In June he has lorgottcn his January defini- 
tions and interpretations and makes new ones. It is hardly a surprise that his 
score is dilTercnt in June 

In educational measuring devices the intent, ot course, is to approach the 
degree ol reliability possessed by ihe steel ruler. The requirements for maxi- 
mum reliability in each basic procedure are discussed in subsequent chapters; 
but by and large, length and elarit\ are the critical variables for rcliabiiitv. As 
procedures are more extensive, they tend to sample moie adequately and the 
relative cflect of a few bad items is diminished Clarity ot expression prevents 
the ambiguity and misinterpretation which, along with deficient sampling, 
constitute the more important semrees ol unreliability. 

Several statistical procedures are available for determining the degree of 
reliability possessed by a given procedure. Degree of reliability is expressed 
mathematically as a correlation coefficient and is called a coefficient ot rc- 
liahility. The size of coefficients for published standardized tests tend to range 
from .70 to .95, with the most reliable achievement batteries having coefficients 

.90 to .95. The significance of coefficients of difierent sizes together with 
techniques for deriving them arc explained in Chapter 8, pages 172-lM. 

FFFICirNCY 

Efficiency, as you know, mi. ms that the most output is accomplished with 
the least input; and >ou know also that it is a criterion for many human 
activities other than measurement. With reference to measurement, efficienc) 
means that a given procedure provides the needed measurement s\mbols in 
minimum lime and with minimum expense and energ} when compared with 
other possible procedures Even more than validity and reliability, is it a 
function of specific types ot pri'cedure; \ou are referred to the irea'anenls ot 
these procedures for means of increasing their clficicnc\ 

It should go without saying that time and energy saved in mea^iirernenl 
at the expense of reliability and/or validity is iictually lime wasted. Himevcr, 
the relationship between reliability and validity and the duration ol a measure- 
ment procedure is curvilinear and, after a point, large increases in the length 
of a test or other procedure result in only slight increases in reliability.''* 

■"‘See page 114 in Chaptci 6 for an analysis of the relationship between length of a 
test and its rcliabiiitv 



44 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


Moreover, the most careful and time-consuming revision of an already rea- 
sonably valid test can be expected to yield only slight gains in the test’s validity. 
The curve of Figure 2 suggests the relationship likely to exist between re- 
liability or validity and time or energy 
devoted to a measurement. 

Thus, when we measure, we need 
to examine the purpose ot our meas- 
urement and the degree of accuracy 
necessary. When our procedures 
achieve this degree of accuracy, their 
extension or addilional lime spent on 
improving them may not be profit- 
able For example, machinists must 
use micromckrs, which arc acciuatc 
to the ten-thousandth ot an inch; but 
carpenleis need only a tape measure 
whose smallest unit is one sixteenth 
ot an inch; and those who mark dis- 
tances on highway signs need onl> to use odometers whose tincst leading is 
one tenth of a mile. In general, the greatest degree of precision is necessary 
for research purposes, the next foi decisions and predictions about individual 
pupils, and the least lor decisions and predictions about groups* of pupils. 

Some Technical Terminology of Measuring Procedures ‘‘ 

As behavioral measurement h<is been developed by teachers and p‘^>- 
chologisls over the last fifty years, a number ot terms have t.ikcn on spccia’, 
technical meaning for the procedures of mcasuiement Several ot these terms 
arc significant foi all types ot proceduics and need to be understood it de- 
scriptions of the procedures arc to be meaningful 

Instrument. Any tangible device used to assign measurement symbols 
to phenomena. A test, rating torm, rccoid sheet, check list, test battery, etc , 
arc all called instruments 

Test. A special type of measuring instrument Its general characteristic 
is that it forces responses from a pupil and the responses arc considered to 
be indicative of the pupil’s skill, knowledge, attitudes, etc 1 rue-false tests, 
essay examinations, attitude scales, short-answei tests, mid-terms and finals, 
and a personality inventory are all technically to be called tests 

Item. A part of a test that elicits a specific response Examples of an 
item are a multiple-choice question, a problem on a mathen atics test, a true- 
false question, and a picture in a projictive personality test. 

Standardized. The instrument (usually a test) so designated has pre- 
viously been administered to a population with known characteristics. A 

In the Appendix A is an extensive glossary of lei ms connected with educational 
measurement and evaluation. 



Figure 2 Piohabic iclatjunship be 
tween the ieliabilit> or validity of a mcas- 
uremciil pioccdiiie and the time and 
eneigy dc\otcd to the pioccclarc (Ibis 
cm VC IS illustrative only and has no pre- 
cise mathematical significance ) 



PROClDURtS OF MEASUREMENT IN GENERAL 


45 


pupil’s score may then be interpreted with reference to the scores made by 
the group of pupils upon whom the test was standardized This group is called 
die “standardization sample ” For example, the Davis-Eells Games were ad- 
ministeied to approximately 3,000 pupils per grade in order to determine the 
si^nilieaiiee ot different scores 

Notrn The score on a standardizeei lest that is topical ot a gi\cn age or 
grade ot pupils or of some segment of the age or grade population In the test 
cited above, a score of 41 is the average score ioi pupils nine and a half to 
ten years old A score ot SI represents the 91 percentile rank a none this age 
of children Both the average score and the 91st percentile score arc notms 
The derivation of norms involves statistieal Icehniques and these are explained 
in Chapter 7 

Buis The tendency of an instrument to favor a certain outeomi or to 
disciiminate for or against pupils with cciiain chancteristics Most paper 
and pencil tests designed for use with groups are biased in favor ot good 
icadeis That is, if two pupils have equivalent knovvled e the one who is the 
bettci reader is nkelv to make the higher seore Man_> questionnaires dis- 
tributed b\ political organizations arc so constructed that the replies in- 
evitably f Ivor the polieics of the political orcinizition 

Critcnon that with whieh a proecdurc and/ or mstrumcnl is compared 
to deterininc its validitv 1 he ratings given by clinical psveholo<usts and psy- 
ehi itiists to a group of people often constitute the ctitenon of validit) for a 
pcjsonalitv lest In intelligence testing the Revised Stanic^rd Binet Seale 
( Appendix B) lor many )e irs his been the cntnh n for intelligcnee tests The 
degree to which a newly devised test produced results similar to those pro- 
ckiecd ov the Binet was olTered as evidence of the new test s \ ilidg> 

Checking test icsponses against a hey to detcimine their cor- 
rer tness meoireelness . to e itecori/e them in some oihei wo issigning 
numbers or othci s\mbols to pupil products or free expressions so as to rate 
them 

Summary 

I he function of behavioral measuring pjcKcdures is to assign the proper 
measurement symbols usually numerals to dimensions of behcvural phe- 
nomena lo aecomphsh this the proecduies emplo) standcii I stm uLitions 
elicit sicindanl dilJcKtiiuil i( span ( use stcuuhud amhsis \\stcnis and engigc 
in samplim. All procedures do noi emplov all the devices but all involve sam- 
pling Samples should be reasonably large and lepicscnt itive both eateconcally 
and temporally lor aecui ite measurement 

Procedures in use rnav be cla sihed as obseivation product anai\sis lice 
response tests and guided response test 1 hree cntcri i are applicable to 
determining their eflectivcness validity reliability, and cfficiene> Validity 
usually is defined is the capability of a procedure to measure what it purports 
< Divis, Allison, iiid Fells Ktnnilh Da\is I ells Gaf 'ts New York Woild Book Co 



CHAPTER 4 


OBSERVATION 


The most primitive of man’s procedures lor measurement is simple ob- 
servation. The nomad hunter looked at tracks, listened to animal cries, felt the 
breeze against his cheek, watched the moon and stars, all to determine what 
game were present, what the weathei \\as likely to be, and what season was 
approaching. Yet, two of the most important of civilized man’* vocations, 
medicine and psychology, employ observation as their principal technique 
of measurement. For all his meter readings and laboratory analysis, it is 
largely what the physician’s eyes see, and what his probing hands IccI, that 
tell him what your illness is and how serious it is. And, although the psy- 
chological clinician may use a Rorschach test and even an electroencephalo- 
gram, his diagnoses of the neurotic’s phobias and the psycl^olic’s delusions 
depend heavily upon what he observes of the patients’ talk, gestures, and 
facial expressions 

We shall define observation as measurement without instruments, or, il 
instruments be used, tficy affect the measurer and not the measured In ob- 
servation procedures, the measurer applies his senses directK to the phe- 
nomenon being measured, lie looks at the behavior (studying), he docs not 
look at a score on a test that is indicative of the behavior (a score on a test 
of study habits). Because, in observation, we apprehend the dimensions of 
phenomena directly, observation, of all procedures, is most susceptible to 
errors of faulty perception and bias F’or this reason, it is a primitive technique 
often despised by the scientist for its unreliability. But, just because dimen- 
sions are apprehended directly, all types of sensory data can be handled 
simultaneously and in proper relationship. Moicovcr, the pattern of a phe- 
nomenon can be seen as well as its elements. Thus, observation can be a highly 
civilized procedure, valued for its validity by the same scientist wiio decries 
its unreliability. 

In education, observation is the most wfidely used of all measurement 
procedures. Principally, this is because many behavioral phenomena may 
not be assessed validly by any other procedure. In suceeeding paragraphs we 
will explain some of the factors that affect the usefulness of observations, 
describe several observation techaiiques, state the forms of measurement that 
observations may be expected to yield, and suggest the types of educational 
phenomena for which observation is a legitimate measuring procedure. 

48 



OBSERVATION 


49 


Standard Analysis and Adequate Sampling in Observation 

In observation, an attempt is made to appraise whatever happens, as it 
happens. Consequently, neither standard stimulations nor standard responses 
may be utilized to make the measurements more accurate. The observer must 
try to make his observations competent just through proper use of a standard 
analysis system and through adequate sampling. 

To insure that the same measurement symbol is assigned to phenomena 
having the same characteristics, you may recall that it is nccLSsar> to have 
a standardized way of reacting to phenomena In observation, this standardized 
reaction is especially difficult to attain. For one thing, what is to be observed 
is entirely a junction of the observer There is no test or other instrument 
present to force a given desired behavior. What behaviors arc occurring is 
largely happenchance, and what the observer sees and hears of what does 
occur is entirely up to him. F^'or example, a teacher ma> wish to observe Bill’s 
work habits during the fourth period. Bill ma> or may not work during the 
fourth period. If lie does work, it is still up to the teacher to keep his attention 
on Bill, to attend to Bill’s actions rather than Bill’s appearance, and to watch 
Bill’s work-oriented actions and not his socially oriented actions Then, the 
teacher may wish to observe Nancy’s work habits, so as to compare them 
with Bill’s. Again, the same disciplined perception is required, and, in addi- 
tion, the observation of Nancy must be like that of Bill. 

A second deterrent to standard analysis in observation is the I act that the 
observer necessarily is aware of his own feelings as well as of external happen- 
ings. ft is as if a thermometer were at the same time a^v'ind gauge yet seemed 
only to measure the temperature. As the wind increased and decreased, the 
thermometer would show changes in temperature even though the actual tem- 
perature perhaps lemaji.cd constant. Just so, a teacher needs to observe and 
rate the “citizenship’ of pupils but finds it diliicult to keep his stomach, 
muscles, and heart out of the piu.ure. When he is well-fed, rested, and happy, 
the pupils may be good citizens. When he is hungry, c^r dyspeptic, tired or sad 
they are more apt to be considered hellions. Yet the pupils may not have 
changed at all. 

The problem of sampling is less complex for observation than tor st^me 
other procedures, but it is no less critical So far as temporal sampling is con- 
cerned, it is known that rhildrco and youth fluctuate widely over iclatively 
short periods of time in their moiivation, achievement, interests, and nearly 
evciy thing of educational significance. Uniformity there is, but it is the uni- 
formity of a mountain range rather than that of a plain. Yet, the teacher 
frequently is called upon to evaluate motivation, achievement, and interest as 
if they were always the same for a given pupil. It is little wonder that his judg- 
ments arc sometimes in error when, as too often happens, they arc based on 
only a few observations of the pupil. 

Categorical sampling (representing properly the different elements of the 



50 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

^ 

phenomenon) is evew more of a problem in observation. We know that a 
child’s progress in speaking properly involves his improvement in enunciation, 
rate, tone, vocabulary, usage, etc. If the speech of pupils in a ninth-grade class 
is to be measured accurately, the same attention should be given to each pupil 
with regal'd to each of the dimensions. If some are heard only in situations 
that involve easily pronounced words and some in situations involving dilhcult 
words, if some are observed in recitations where there is little opportunity for 
tone variation and so on, categorical sampling obviously is inadequate. 

Some devices arc presented shortly (pages 52-5^)) that will facilitate 
standard analysis and representative sampling. However, you will find that 
the observer’s skill and concentration are more ^Titical than his technique. 
It is necessary lo know the significance of standard analysis and representative 
sampling, and then consciously to practice them, if you wish to be a com- 
petent observer. 

Forms of Measurement Appropriate to Observation 

Historically, observation has been a means to measurement by description, 
classification, and rankint^, but rarely to scale measurement. Teachers in both 
ancient and modern times have watched pupils and have labeled them good, 
fair, or poor in achievement; lazy or industrious in study; and well-behaved 
or unruly in deportment. Ihen, again, teachers have listened to speeches and 
ranked them 1, 2, 3. They have designated Henry as third in his^class, and 
Marie as seventh. But, with a few exceptions, they have not said a pupil is 
eleven years and six months (a scale number) on the basis of observation 
alone. Nor have they assigned him an 10 (another scale score) nor indicated 
how many seconds it took him to run a race from simple observation only. , 

Similarly, except where counting is involved current observation pro- 
cedures may be expected to yield only descriptive and classilicatory symbols 
and rank numbers, d’hc descriptions produced by observations are discussed 
under check lists and anecdotal forms (pages 52-5-1). Classification sym- 
bols utilized in observation range from the simple dichotomies of satisfactory — 
unsatisfactory, obedient — disobedient, to the many valued classificatory 
systems of certain rating scales (sec pages 5()-57). Where numbers that 
do not indicate a count arc recorded as the result of observation, they usually 
represent verbally defined classifications or rank within a group. It is possible 
to derive scale numbers from observations, but the techniques arc so time- 
consuming as to be justified only in research. Hence, the significance and 
accuracy of observational measurements usually arc limited by the significance 
and precision of descriptive, classificatory and rank symbols. 

Evaluation as a Direct Product of Observation 

Observations frequently produce evaluations rather than measurements. 
When, as a result of observation, a tenth-grade pupil is given a grade of B, a 
beginner’s performance on the piano is awarded a certificate of merit, or an 



OBSERVATION 


51 


emotionally disturbed child is placed in a mental insBitution, an evaluation 
has been made, not a measurement. In the introduction to Section I, we 
differentiated between characterizing the status ol phenomena (measurement) 
and judging the value of phenomena (evaluation). The latter, you may recall, 
involves assigning a value symbol to something, usually by comparing its 
status with some standard of value. Such a comparison between status and 
standard was made in the case of the grade, the certificate of merit, and the 
institutional placement. Probably, the teacher had in mind several levels of 
achievement for tenth graders and he thought that the pupil’s achievement 
matched the level that was worth a B. The piano teacher knew about what to 
expect of beginners. 1'he beginner appeared to live up to expectation and, 
hence, was judged to deserve the certificate that symbolized that quality ot 
piano f)laying. And the psychiatrist who judged that the child needed confine- 
ment and treatment possibly did so because the child’s behavior seemed to be 
similar to that of a classification ot symptoms called schizophrenia and hence 
was evaluated as requiring hospital treatment 

Evaluations based on observation are often subjectiv'c The standard used 
by the tenth-grade teacher was in his own mmd, as was that of the music 
teacher and the psychiatrist Now, as we shall see in Chapter 9, evaluation^ 
tend to lose validity when they are made subjectively. Moreover, when the 
evaluation ol a phenomenon is the onlv thing expressed, it is not possible to 
verify the measurement on which the evaluation is based or c\en to ascertain 
that any separate act of measurement occurred. 1 hus, the tcndcnc) of observa- 
tion to produce evaluations directly and immediately may produce errors m 
measurement, in evaluation, or in both. 

* To guard against such errors it usually is advisable first to observe and 
characterize the actual status of the thing in cjuestion and then to compare 
I his status with a standard, keeping both steps separate and explicit. The 
standard to be applied should, moreover, be written down or given objective 
form in some way. II, for some reason, the measurement aspect of observation 
may not be explicit, it may at least be kept scfiarate from the evaluation. This 
requires that >ou, as an observer, must know when you are seeing the char- 
acteristics of a pupil, and when you are judging what you have seen. It re- 
quires, moreover, that you must consciously j'jcrlorm the measurement first 
and the evaluation second 

Evaluations that derive immediately from observation often are called 
raiings, and the devices used for such evaluations usuafiy arc called rating 
scales. These arc discussed on page 55 The jinmarv 1 unctions of evaluative 
rating scales (some are simply measuring devices) arc to give more objectivity 
and clarity to standards and to insure more systematic comparisons between 
phenomena and standard Thus, their use serves to increase the validity of 
observational evaluations. 



52 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


Devices Used in Observation 

A number of verbal and graphic devices arc used in observation. In gen- 
eral, these are designed to insure standard analysis and adequate sampling, to 
mitigate against too quick and too subjective evaluations, and to predetermine 
the forms in which measurements are to be expressed. 

CHECK LISTS 

A check list is any sort of device, simple or complicated, which contains 
the items to be observed and, perhaps, space for number or short verbal 
entries. The check list in Figure 3 is an example of one that a physical educa- 
tion instructor might use. Other check lists provide for an indication of the 
presence or absence of certain factors. These are employed frequently in 
surveys of school plants and facilities. One that could be applied to the audio- 
visual services of a school might contain such items as: Central storage of 

equipment , Storage in classrooms , Teachers operate , 

Students operate . Checks would be placed by the items that obtain 

for a given school. 


Name • 


Gia lie- Age- Date 


1. Type of activity . 

2. Interest 

3. Effort 

4. C o-ordination 

5. Posture 

6. Skill in activity . . 

7. vSportmanship 

8. Other factois . . 


Figure 3. C heck list for observation in physical education. 


Check lists sometimes arc called observation schedules, particularly when 
they are lengthy. They arc employed extensively in Homemaking, in Industrial 
Arts and Agriculture, in Physical Education, in Art, and in all other school 
areas where observation is an important procedure of measurement. I'hey help 
make observations more reliable by stimulating the observer to look for the 
same factors or dimensions each time he observes. They do not eliminate bias, 
they do not insure adequate time sampling nor uniform recording. 

Few check lists are available commercially, and in most subjects, for most 
purposes, they are not especially desirable. Published check lists arc likely to 



OBSERVATION 


53 


be no more reliable than self-constructed ones, and they are likely to be less 
relevant to your measurement task. To design one, enumerate the dimensions 
you wish to observe, define them clearly, eliminate those that are vague or 
repetitive, arrange them on a sheet of paper in whatever manner is most 
convenient for observing and recording, include space for identifying data, 
and try out the form. When it seems workable, duplicate a small supply. When 
you have used up this pilot batch, revise the check list on the basis of your 
experience, and run off the quantity you will need. 

ANECDOTAL FORMS 

An anecdotal record is a check list that provides space for much writing, 
and, as a rule, provides for less breakdown of dimensions than a check list. 
Often the forms are used over a period of time and are meant to be cumulative. 
Their significance is primarily to minimize the use ol high-level abstractions in 
recording and to obtain instead an “operational description” of behavior. An 
anecdotal record designed to serve much the same purpose as the physical 
education check list in Figure 3 is shown in Figure 4. 

Purel} as paper-and-pcncil aids, anecdotal forms do little more than do 


Name 


Diicc fions 

In the space piovidcd, record ob^eivatioiis that bear on the mdividuaFs 
physical dc\eloprncnt and socral development. Do not cxalnutr, hia cUscnhi. 
Avoid vague words such as good, strong, shy, etc Enter statements of what 
happened, or what vou saw, as “Did three push-ups, and couldn't elo any 
more," “Cried and slaitcil hgiiting when he was called out." Dutc tcuh entrv 

Physical Development: 


Socral Development: 


Figure 4. Anecdotal record for physical education. 



54 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

blank sheets of paper to insure validity and reliability in observation. As a 
viewpoint about recording observations and as a means to enforcing that view- 
point, they have considerable merit. I he viewpoint is that observation records 
are valid and reliable to the extent that they actually reproduce whatever was 
observed. The best anecdotal record is a sound motion picture. In practice 
this ideal is not achieved. What can be achieved is a relatively unambiguous 
record of things that a teacher noticed and thought important about a pupil’s 
behavior. Proper use of anecdotal recording tends to make tor more adequate 
temporal sampling, but categorical sampling often suiTers since the observer 
is not recurrently reminded ol what to observe. 

Anecdotal recording is used most frequently by elementary teachers and 
in courses where citizenship or social development is an express goal. In 
addition, counselors, school psychologists, seKiiil workers, and supervisors 
make extensive use of anecdotal records. As with check lists, there are rela- 
tively few published forms and self-designed anecdotal forms arc generally 
more satisfactory 1 he construction oi the (orm is even less difticult than the 
construction of a check list, but to record propeily is somewhat more diihcult 

Proper anecdotal recording is charactci i/cd by the lollowing. 

1. What is written down is what was seen or heard. Interenccs, guesses, 
assumptions, arc omitted unless they aie cleaily labeled as inlcrenccs, guesses, 
or assumptions. 

2. 1 he observer has determined what aspects ot behavior are^related to 
the dimension being appraised. He observes these only and records these only 

3. It the record is t(^ be cumulative, a plan of periodic observation and 
recording is established and adhered to 

4. Words and phrases arc used w'hose meaning is clear, and, so far a 
possible, unequivocal 

5. Words and phrases arc emplo\cd that arc definable in terms ot things 
rather than other words. Concrete statements are preferred to abstract ones 
For example, "‘He became pale and his hands trembled,” not "He was dis- 
turbed.” 

f*. Words and phrases that have strong emotional eonnotatioiis are avoided, 
i c., love, hate, insolent, courteous, loyal, dishonest, etc 

7. Words and phrases aie avoided which express the ob^ervefs judgment, 
or his opinion, and not just his perception Among the frequently eneountered 
“judgmental” terms that should be avoided are these: 

j. well-behaved e. industrious 

b. delinquent 1 nervous 

c. aggressive g. happy 

d. didn’t tr> 

RATING SCALES 

As observational aids, rating scales are at the opposite end of the spectrum 
from anecdotal forms. Recording is brief and formalized but much is printed 



OBSERVATION 


55 


on the form,"^ whereas anecdotal forms provide for extensive and informal 
recording and contain far more blank space than printing. The types of rating 
scale are legion and the example items shown in Figure 5 can only suggest their 
myriad forms. 

Measurement Scales and Evaluative Scales. Of the rating scale items 
shown in Figure 5, B, D, and F are primarily devices for recording status 
(measurement) while A, C, and G seem to be means of expressing judgments 
(evaluation). In B, D, and F you may observe that the scale merely covers 
the possibilities of the dimension in question. Any high or low value placed 
on a given rating is a matter of implication or for later determination. In A. 
C, and G, on the other hand, the observer clearly is expected to evaluate the 
pupil on the dimensions in question. Not only is he to observe what the pupils 
condition is, but he also is to compare this status with a standard and thu-> 
jiid^e the pupil: “satislactory” in study habits, a “pc'ior’' performer an the 
playground, possessed of the "'best possible” attitude toward school. 

Scale Including an Evaluative Standard. Item E illustrates a rating scale 
that is a measurement device but has an evaluative standard superimposed 
upon it. Where the standard and the measurement scale are related properU, 
the device has great merit. Subjectivity in evaluation is eliminated for the 
observer. He need concentrate only on finding the point on the scale that best 
expresses a pupils status (m posture in Example E) and the evaluation is 
made automatically. The propriety of the evaluation is then not a function 
of the observer, but of the rationale underlying the scale. Jn the case of Ex- 
ample E, physicians' opinions as to the relationship between diflerent postures 
and health presumably is the basis for the evaluative scale and evaluations 
4’re valid to the extent that these opinions are vahd 

Category Scales. Rating scales sometimes ash for phenomena to be 
assigned to categories and sometimes to be placed along a eonlinuum. Ex- 
amples A, B, C, and E are categorical rating scales. In each instance a pupil 
is to be given a definite classification, S or U; Always, Usually, or Seldom; 
etc. In such ratings, only gross dillcrences between pupils are appraised, and 
tacitly it is assumed that all the pupils assigned a given categorical rating are 
alike in the dimension being appiaiscd. 

Continuum Scales. Continuum scales arc exemplified by items /), F, and 
G, in Figure 5. Example F is a pure continuum. A line is shown to represent 
the variation between two extremes. A pupils status is to be indicated b> 
checking any point on the line, and, in theory, each pupil might be checked at 
a different point, and thus the uniejue status ol each pupil is retained in the 
rating. Modified continuums are shown in D and G. I he rationale for these 
two scales is the same as for F; extremities of stiitus arc established and degrees 
of variation between the extremities are represented. Rather than unbroken 
lines, however, numbers and phrases arc shown at intervals along the line, and 
the observer’s task is to choose the appropriate number or phrase Ihesc two 
continuum scales are analogous to a ruler that contains inches, but no frac- 
tions of inches. Distance measured by the ruler may be read to the nearest 



56 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


inch. Variation as to persistence may be marked to the nearest phrase, or 
variation as to attitude toward scljpol to the nearest number. If the observer 
wishes, he may interpolate between phrases and numbers, just as with a ruler 
you may estimate 3.5 inches, when only 3- and 4-inch intervals are marked off. 

Graphic or Descriptive Scales. Items D and F in Figure 5 illustrate two 
other possible characteristics of rating scales. In D each phrase is given a 
number value and, though a pupil is rated according to his approximation to 
a phrase, his rating is recorded as a number. Such use of corresponding num- 
bers and phrases is widespread. These scales usually are called graphic rating 
scales. 


(A) (Write a letter) 

Rate the child on each trait listed as satisfactory or unsatisfactory 
by putting .V or U in the space provided. 

Honesty Obedience 

Neatness _ _ Study Habits 


(B) (C heck a column ) 

Always Usually Seldom 

Keeps his desk, bt)oks, ^ 

and other materials 
clean and neat. 


(C) (Write a nunibci ) 

Hvaluatc the pupil’, behavior in tegard to each factor by 
placing 1, 2, or 3 in the box by the faclor. 

J, like the best pupil’s behavior 

2. like the average pupil's behavior 

3. like the poorest pupil’s behavior 


Playground activity □ 

(lOing to and fioni school □ 

Attitude towaid tc^icheis [_1 


(D) 


Melts 

before slight 
obstacles or 
objections 

(5) 


(Choose a phrase) 


Is he easily discniiragi d or Is he persistent? 


Gives up 
before 
adequate 
trial 

(3) 


Gives 
everything 
a fair 
trial 
( 1 ) 


Persists 

until 

convinced 
of a mistake 
( 2 ) 


Never 
gives in, 
obstinate 

(4) 



OBSERVATION 


(E) 


(Choose a picture) 


POSTURE ST/>«>ARDS 
Intermediate- Type Boys 



Excellent Good Poor 




57 


(h) (Check a point on a lino) 

Moial 

Standaids « 

r xtromolv 
moial, 
hypcrac live 
coiiscitncc 


No apparent 
solf-standaids, 
al\vus tollows 
the ciov\d or 
c\pcdienc\ 



(G) 


Attitude 

toward 

school 


(Check a numhti on .i line) 
12'^ 4 


Worst 

possible Avciage 


5 


6 7 

Best 

possible 


1 iguie 5 l\pes of laiing stale items. 

“^Adapted from Aiman Klein anfl Leah C Thomas, Fosiitrc F\cni\(S^ Children's 
Bureau Publication, 165. Washington, DC, US Department of Health, Education and 
Welfare, 1926, p 1 2. 


Scales With Unequal Intervah. In Item O' you may notice that numbers 
are unevenly spaced; 4, the center number, represents an interval four times as 
great as 1 or 7, the numbers at the extremes. The use of unequal intervals has 
both a psychological and a statistical basis It is known that it is more difficult 
to make valid distinctions among the large group of pupils we call average or 



58 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


normal than it is to note differences among those children who deviate from 
the average to an extreme degree. A G-type scale compensates for this condi- 
tion by spreading the size of the middle intervals and constricting the size of 
the extremity intervals. From a statistical point of view, a G-type scale forces 
the jatings of large numbers ol pupils to approximate a normal distribution 
(sec Chapter 8, pages 161-169) and thus to approximate the assumed real 
distribution for many human dimensions. 

Measurement Symbols Derived from Scales. With the use of continuum 
rating scales, it is possible to obtain numbers from which rank or order sym- 
bols may be derived. Obviously, if one pupil is given a rating ot 3 and two 
otheis ratings of 4 and 5, each is asserted to have more or less of a dimension 
than the others, and consequently the rank of each of the three is apparent 
from their ratings 

It is not possible as a rule to attribute scale position to the numbers ot 
lating scales, even though they are called “scales.” You may recall (Chapter 

I, pages 6-8) that the numbers of a valid scale represent known and 
usually equal intervals, and that the numbers count from a zero or other fixed 
and known reference point. In our observational rating “scales” it is apparent 
that these conditions are not met ^ Whether or not line subdivisions and/oi 
numbers represent equivalent incrcriicnts of diflercncc is entirely a matter ot 
conjecture. Moreover neither the center nor the ends ot the usual rating scale 
may be con.sidcrcd to have fixed values 

Number of C atc^oru\s or Subdivisions in Scales. I he number ol cale- 
gorics or of subdivisunis optimum tor a rating scale is indeterminate. The 
usual number of intervals is five, but Ihcic is no rational justification for this 
number Wrightstonc, reviewing leseaich in rating methods, asscits that seven 
is an optimum number for rating human traits (9:962). Rating scales with 
more than ten units are unusual, and a two-unit scale (a dichotomous rating) 
is, of course, the bottom limit. The principle to be followed in designing a 
rating scale is that the number of scale intervals sliould approximate the num- 
ber of clearly discernible dillcrenccs in the dimension being appraised Foi 
measurement tasks that require great precision a greater number of scale units 
may be necessary. For tasks requiring less precision, fewer intervals are per- 
missible. 

Sources of Si ales. A large number of rating scales are published but, as 
with check lists and anecdotal forms, they arc not subject to standardization 
Many of the scales designed each year for research pui poses arc described in 

1 In rcsctich situations, latinr sculcs ha\c been and ma> be developed with apptoxi 
niately equal inter\als and w.lh somewhat fixed points of reference. Scale numbers can 
he derived from such ‘'scales,” but they are to be interpreted with c; ution. In ordinary 
school situations, lalmg “scales” are scales only in the figuiative sense of the term 
Methods by w'hiLh appioximatcly etiual intervals and fixed reference points may be 
established are described in L. F. Thurstonc, “The Method of Paired Comparisons for 
vSocial Values,’^ Journal of Ahnotmal and So( lal Ps\( holof*^ , 21:'^84-4()0, 1927 and 

J. P. Guilford. Psychometric Methods, New York: Mc(jraw-Hill Book Co., Inc., 1936. 



OBSERVATION 


59 


professional journals and constitute a good source of ideas. Buros’ Mental 
Measurements Y earhooks, publishers’ catalogues and reviews, and advertise- 
ments in professional periodicals provide listings of the published scales. 

Applications of Scales. Rating scales find their greatest use, of course, 
m areas where measurement must rely largel> on observational methods. 
Hence, they arc employed extensively in the appraisal of personality, social 
behavior, and teaching competence. Curriculum evaluation frequently utilizes 
rating scales, as do surveys of school housing and facilities. Report cards (sec 
Chapter 9) are in effect evaluative rating scales whose ratings are based on 
many observations and/or testings in broad categories of achievement. 

Criteria for Rating Scales. Certain generalizations are appropriate to the 
proper design and efficient use ol rating scales. 

1 . 1 he dimensions to be rated need to be very clearly defined in terms ol 
what is to be observed when the dimension is rated. I he dimensions, more- 
over, need to be as distinct as possible, each one being observable and meas- 
urable by itself and not overlapping with another. Only as a dimension can be 
clearly and discretely defined in operational terms can it be rated with any 
validity. Thus, enunciation is a dimension of speech that can be rated elTec- 
inely, while ratings of interest aie nearly always suspect 

2. If several dimensions arc to be rated tor a given pupil, it is probable 
that each rating will be mtluenced by the others and in the same direction 
1 his is called the “halo” effect. If an initial rating is high, others will tend to 
be high; if low, the others will tend to be depressed. Another wa> of viewing 
“halo” effect is that the observer has a Iceling about a person as a whole, and 
he then proceeds to record ratings that will justify this feeling. To hedge 

^against this type of error, you need to conccntiatc on each separate dimen- 
sion you rate and consciously “shut out” your feelings about other dimensions 
or the person as a whole. Frequently rating scales are devised with the right 
extreme of a scale sometime^ having a low value and sometimes a high value 
just to hamper the operation of a “halo” effect. 

3. Some observers arc found to rate low' and others to rate high, no matter 
who or what is being rated. If such proneness to ovei- or underrate is known, 
the observer must take steps to correct it or provide for an automatic correc- 
tion after he has made his ratings. Lack of experience w'ith the full range of 
variation in the item being rated often is a cause ol consistently high or low 
ratings 

4. Other observers lend to avoid the extremes in their ratings. They un- 
consciously use middle or average ratings unless the deviation is so great as 
to force the use of a higher or lower index And, even in this case, they w'lll still 
record ratings closer to the center than an unbiased observer would 

Validity and Reliability in Observation 

Preceding paragraphs have dealt with the nature and limitations of ob- 
servation as a measuring procedure, and with the devices and techniques that 



60 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

may facilitate obse|j’^ation. In this presentation a great many factors neces- 
sarily have been stated that bear on the validity and reliability of particular 
aspects of observation or of its use for particular purposes. Now, in conclu- 
sion, we would like to present some validity-reliability maxims for observa- 
tional techniques as a whole. 

Use Where and When Legitimate. Observation is justified when testing 
or product analysis is too time-consuming, too expensive, or when the dimen- 
sions to be measured cannot be measured validly except by observation. As 
we have observed, reliability is likely to be lower for observation than for other 
procedures and, consequently, it is justified only when more reliable proce- 
dures are not equal to or available for the task. I'he school subjects and areas 
in which observation is most likely to be an e sential procedure of measure- 
ment are Art, Music, Homemaking, Physical Education, Speech and Drama- 
tics, and citizenship, social adjustment, study habits and the like. 

Use Appropriate Fuper-atuLPencil Devices. The function of any ob- 
servational device, check list, anecdotal record, or rating scale, is simply to 
insure the conditions of standard analysis and adequate sampling that observa- 
tion requires and, in addition, to mitigate against some ol the human errors 
inherent in observation Use of the device most appropriate to the given ob- 
servation task is mandatory. Appropriateness ot a device is a function of the 
form of measurement desired, of tfie purpose of the observation, and of the 
phenomenon and/or dimensions being observed. Obviously, a check list or 
an anecdotal form is needed for descriptive data and a rating ^calc of some 
sort for classification or lanking. The purpose of individual diagnosis is better 
served by an anecdotal form on a check list: and that of comparing pupils, by 
a rating form. In general, phenomena with many complex and ill-defined, 
dimensions arc better approached through anecdotal records, while check 
lists and ratings may be used for those with simple and well-defined dimen- 
sions. 

Record Quickly. It is axiomatic that good observation records or ratings 
arc made during the observation or immediately thereafter. Only by such quick 
recording can you insure that the lecord is .strictly a function of what was ob- 
served. As time elapses, various types of distortion are likely to occur.- Detail 
will be forgotten Intervening experiences will interdict or become confused 
with the conversation. Remembrance of what was observed will become pro- 
gressively more like the stereotype it approximates. Strong imprc.ssions will 
become stronger, and weak ones weaker. 

If the keeping of a record during a period of observation may invalidate 
the observation, notes must be made immediately after. Even a wait of thirty 
minutes or a few hours is likely to invalidate the record. As a ^ulc, if a person 
is likely to construe that your observation or its outcome has any effect on his 


2 The distortions that occur between an event and its recollection have been studied 
most thoroughly by “field” psychologists. Their findings and generalizations arc reviewed 
in Ernest Hilgard, Theories of Learning, New York, Appleton-Century-Crofts, Inc., 1948. 



OBSERVATION 


61 


reputation or prestige, or even on your opinion of him, '^cording during the 
observation is improper. You may proceed to record during the observation 
only when you are sure that the fact of your recording is not an important 
stimulus to the individual. 

Guard Against Bias. The existence of some preconception or feeling set 
toward the person or thing observed is highly probable. It is advisable then 
to assume that such bias exists, to determine what it is, and, consciously, to 
try to keep it from affecting the description or rating you record. If the bias 
is extreme, you should disqualify yourself as a competent observer, just as a 
judge docs when he has a special interest in a case being tried. One way of 
verifying your control of your bias is to record as objectively as possible, and 
at great length, and then to ask an impartial yet qualihed third person to make 
an independent rating on the basis of your record. Unless he agrees essentially 
with your rating, it is well to question your own rating. 

Base Evaluations on Several Observations if Possible. Obviously, a 
judgment based on one measurement is less sure than one based on several 
measurements, even when the measures arc highly reliable. When measures are 
less reliable, use of several measures as a basis fur evaluation is even more 
important. As we have seen, observation has the maximum likelihood for un- 
reliability of all the procedures of measurement. The number of observations 
may be increased by pooling the observations of several observers as well as 
by repeating your own observation^. If several observers are used, it is well 
to assure that all arc competent and that all observe the same dimensions in 
the same way. 

Summary 

A great deal of educational measurement is done with little or no assistance 
from instruments. ‘'Observation'' is a collective term for the various methods 
of this largely unaided measurement. In observation an attempt is made to 
appraise whatever happens, as it happens. Consequently, only standard anal- 
ysis and sampling arc employed, not standard stimulations nor standard re- 
sponses. What is observed is necessarily a function o\' the observer and thus 
inattention and unwitting bias are two important difficulties in the process. 

Observation may yield classilicatory and descriptive symbols as well as 
rank symbols but seldom may it produce scale numbers. Often, the immediate 
result of observation is an evaluation, not a measurement. Several paper-and- 
pencil devices arc available for the recording of observations, for control of 
the observer’s attention, and for protection against bias. Among these are 
check lists of dimensions to be observed, anecdotal record forms, and various 
types of rating scales. 

The validity and reliability of observations are increased by the following 
observances. Use observation as a measuring procedure where and w'hen 
legitimate only. Use appropriate papcr-and-pencil devices and record quickly. 
Guard against bias and whenever possible base evaluations on several observa- 
tions. 



62 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


EXERCISES 

1 Along with several other students observe a group of children at play 
or at school work and make an anecdotal record of what >ou see and hear for 
fifteen minutes Compare your record with the records of your classmates Note 
all the differences among your recoids and discuss the reason for them as well 
as their implications lor reliability in observation 

2. Select some phase of a subject >ou leach or intend to teach tor which 
pupil evaluation might be based on obseivation Picpare a check list, an anecdotal 
form, or a rating ^calc for use with pupils on this phase of the subject 

3 Restricting yourself to a grade or to a subject, list the aspects ot pupil 
performance and achievement best measured or >nly mcasur ible through obsei 
vation Briefly justify the use ol observation for c<ich 

4 Reflect on and list the biases you have thtit might aflcct youi evaluations 
of pupil conduct c^i eiti/cnship Indicate how you might guild ag iinst the influence 
ot each ot the biases you list 

5 Inspect the guidance toldcrs ot several elementary and sccondaiy pupils 
Notice all the observational records and ratings contained in ihcm Wiite a bnci 
critique ot the obseivational procedures based on the principles ot v ilid observa 
tion stated in this chapter 


BlBl lOGRAPllY 

1 Buios, Oscar k, I oiitth \hntcil Maisintnunts )i(t Hook Highl md Paik 
N J Gryphon Press 1953 

2 ‘ I valiiating Pupil Progress Califotnia Stati Dtponnunt of 1 d tmtion hiil 
Ictin XXI, No 6, April 19^2 

3 Hageeity, M t Olson W C and WiLkham I k Olson W n I lum 

Bthcnior Rating SefuduUs Yonkeis Woild Book (o 19^0 

4 kcislcr r K An Improved Formula for Scoiin" C ertun Cjulss Who R itinas 
at the Adolescent level lonnuif oj Idiualioni! Ps\(hoh/i,\ -iS I'^l 1()0 
Maich, 1954 

5 klcin Arman md Ihomas I c ih ( Postiin Imuisls ( hildun s Bure iii 

Publication No R)> Wa^hm ton DC IS Depirlmintol Hedth 1 ducation 
& Wcllaie, 1929 

6 Sccdorf t FI , 1 xpcrimental Sludv m the Amount ol Am cement among 

Judges in F valuating Or il Inteipictation loiinud of I dnmtional Rcsccirdi 
43 10-21, 1949 

7 Sells, S B Observational Methods ol Research i itmg scales Rimch of 
Lducational Ri search, 18 429, December, 1948 

8 Wilson J W Correlation of Clinical I stimt tes with lest Scoics on Mental 

Ability and Personality Tests fourna! of Clinical Pwclu 10 97-99 

January 1954 

9 Wrightstone, J W ‘Rating Methods in t ncydupedia of I diicational Ri- 
search ed W S Monioc New York The Macmillan Co, 1950, pp 961 
964 



CHAPTER 5 

PRODUCT ANALYSIS AND TREE-RESPONSE 
PROCEDURES 


A great amount of educational measurement is based on appraisals o( 
products. Pupils write compositions, make drawings, prepare notebooks, and 
design various artifacts in such classes as shop and homemaking, all as part 
of their regular learning activity. Ihcn teachers examine these products to 
evaluate the pupil's achievement. Again, pupils are diiecled to produce things 
specifically for purposes of evaluation: to write paragraphs or short composi- 
tions on specified topics, to draw a paramcciiiin as part of a biology test, or 
to sketch a cross section of a typical volcano in a geography quiz. Such “test 
products” constitute an additional important aspect of educational measure- 
ment. 

For .semantic convenience “test products” are called free rc.sponses in this 
text. The more usual term lor a test item that asks for an extended written 
.response is “essay question'’ or ‘"essay examination.” This term has unfortu- 
nate connotations ol unreliability and, moreover, is too restrictive for oiii 
scheme of classification Consequently, the phrase ‘"free-response” question 
or item is preferred. As will be seen in Chapter 6, free response is to be con- 
trasted in our view with guided response, a generic heading for true-false, 
multiple-choice items and the like. 

In this chapter we shall treat product analysis and free-responsc proce- 
dures jointly because of their inherent similarity as measuring procedures. In 
turn, attention wall be given to some of their general characteristics, to types 
ol free-response questions, to methods of scoring and forms of measurement 
and, finally, to the applicability of the procedures. 

General Characteristics of Product Analysis 
and Free-Response Techniques 

In Figure 6 are portrayed some representative products together with 
scoring notes, and in Figure 7, examples arc shown of Iree-response items and 
their scoring. Both sets of examples have been selected to illustrate the varied 
approaches to and problems inherent in such measuring procedures. Frequent 
reference will be made to the Figures as our discussion progresses. 

63 





64 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


Products 



Thf o 


Scoring 

0\ef all tanking 

In assigning this a rank the teacher 
simpi> compared the drawings of all his 
pupils, placed them in order of merit, 
and gave the drawing the appropriate 
number Ihc tanking is entirely subjec- 
tive but IS based on his opinion of such 
factors as composition, drawing skill, 
pcrspeetive, and sympathy with the sub- 
ject 


()\Lt-all fating 

I he teacher of this pupil in an eleventh 
gi adt U S histoiy el iss rated his cartoon 
as an A 1 his means ih it the Ic lehei had 
in mind seveial ealegoiies ot cartoons 
and the pupil’s had the essential ehaiac 
tcnsties ol the A citegoiy Hu sc chai 
icteiisties are a clear coneepoon of a 
true and impoitant histone il situation 
that IS ilhisti ited elcailv, appiopii Uely, 
iiid mleiestingly 


Figure 6 Examples of pupil products scored by various methods. 



PRODUCT ANALYSIS AND FREE-RESPONSE PROCEDURES 


65 


Products 


I 


Scoring 

Factor coiintui^ 



The tcat-her was concerned with three 
types of cnor in spelling in punctua- 
tion, and in syntax The cirors in the 
pupil’s papci were classified and counted 
Their kind and amount made the paper 
baicly satisf ictory, so the pupil was given 
an S— 



] (U t4>r rutuif* ' 

Ihis dicss, made h> a Cu ide Xll stud I 
ent, was i ited on i ti\e letter scale (/t 
/?, C D I ) for each of lour ficlors, cLS ' 
follows I 

1 Suitabilitv of pattLin and ma I 

tLiial F 

2 ( tinstriiclion (neatness, atepted 

methods, pi tossing, time to com- 
plete, etc ) ^ I 

1 Appeal an^^L on individual C , 

4 Pioducl as Lompared with girls 

ability C- I 


D 


Piguie 6 (Continued) 



66 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


Free-response items 

Scoring 


factor countirif* with weights 

Question What is the diffeicnce between 

Each specific and accurate character 

an electromagnet and a permanent 

isfic of electrom ignets against permanent 

magnet^ What arc some uses of elec 

migncts was to leceive 2 points Fach 

tromagnets ^ 

iccuratc use was to receive 1 point up to 

Answei An electrom ignet is an iron 

i maximum of ‘x The different weighting 
for char leteristics and uses indie ites th it 

2 core with a wne around it and an 

knowledge of characteristics is thought to 

electric current goes through the wire 

be the more important 

2 The difference is an electromagnet can 

be turned on and off b> clcctricitv 


2 Some uses ue a doorbell cianc 


1 motor 


9 

b 


— 

— _ _ _ _ 

Question Whit is the length and wilth 

lac tot nititii, (is neht^or wrong) 

1 he purpose of this aleehi i problem 

of a icctanglc whose length is 2 inches 

IS to see whether pupils cn use i given 

moic than its width md whose penm 

system foi anily/in^. ind solving i piob 

ctci IS 40 inches^ 

lem there ue five slips in ihi system 

Answci 

md the pupil is m irked right or wrong 

1 (jiven Length is 2 inches ^ic ter tli in 

for ewh step This pupil w is in tiroi on 

width 

Step 1 omitting Hit peiirnetei of the 

2 To find Let width _ \ 

reel ingle and hi omitted Step 3 entiicly 

1 ct length \ -t 2 inches 

^ Conditions ^ 

4 t qu itions 

2x 4 2(x:+ 2) 40 

2i -f 2r I 4 — 40 

4i u 4 40 

4\ - ^6 

1^9 

112=11 1 

^ Solution 

Width _ ^ inches 

i 

i 

1 ength 1 1 inches 

F 

1 

1 


I igure 7 I xamplcs of ficc lespon'ie items scored by vaiious methods 



PRODUCT ANALYSIS AND FREE-RESPONSE PROCEDURES 


67 



I 


Diiections Wiiic the fiisl two sentences 
of the (ictt>sbiJig Address in \onr vei> 
best hand Use pen and ink 



cg.-ru^ 

y£> 


C oni,Hi}isnn uith a puxhut scah 
I he pupils specimen w-is compared 
with the several samples on the ^>res 
handwriting scak His wilting most 
closely resembled that in the sample 
latcd as 60 Hence his handwiiling was 
scored as 60 


Figure 7 (Continued) 



68 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


By their nature, neither products nor frec-rcsponse items involve stand- 
ardized responses. In both instances the variety of possible responses is 
virtually unlimited and automatic scoring thus is precluded. Validity and re- 
liability are to be gained through proper use of standard analysis, sampling 
and .standardized stimulations. 

This matter of standardized stimulations is the only important difference 
between product analysis and free-response techniques as measuring proce- 
dures. Test questions and the test situation tend to constitute more controlled 
and uniform stimulations for all students than do assignments and study situa- 
tions. As a corollary, answers to “essay questions” tend to be shorter and more 
stereotyped than compositions on the same subject written for instructional 
purposes. Hence, free-response techniques, a/ against product analysis, are 
likely to produce the more comparable measures. 

Substantial and important similarities exist between both these procedures 
of measurement and those of observaiion. Products and free i espouses arc 
appraised directly with unaided senses and olten they are evaluated quickly 
and subjectively just as are the behaviors measured through observation. 
Please notice that for all the examples in Figures 6 and 7 the scoicr must read 
or view the products himself, and that for Examples A and B the scorer enters 
entirely subjective ratings lor the products as a whole In product analysis 
and free-response techniques, the things measured arc artifacts. They do stand 
still and, thus, can be reappraised, l^his amenability to reappraisal makes the 
two techniques potentially more reliable than observation, but bias and other 
irrational factors still can inhiiencc the measures obtained from tliem, even 
as they distort the results of observation. 

Tenets of Good Ohservatic^n Pertinent to Scorini,^ ProuiKts and hrer 
Responses. To insure that direct and subjective evaluation does not destroy 
the effectiveness ot product analysis or free-response techniques, it is neces- 
sary to follow the principles laid down lor observation (pages 60-61). In 
brief, they arc; 

1. To keep measurement and evaluation separate 

2. To use a known and defined criterion 

3. If possible, to use a tangible criterion. Such a criterion is illustrated 
in Example H. 

The effect of personal b^as is to be avoided, cither through self-control 
or through compensation In Example />, for instance, the homemaking 
teacher may have had a preconception that plain but expertly sewed dresses 
are better than fancy ones. She may, by concentration, attempt to hold this 
bias in check when she appraises the dress or she may use a system of scoring 
that compensates for its operation, perhaps weighting other factors more 
heavily. The irrational influence of fatigue, wandering attention, shifting point 
of view, etc., similarly may be minimized by systematic scoring and by con- 
centration on the task. 



PRODUCT ANALYSIS AND FREE-RESPONSE PROCEDURES 69 

Products as Evidence or Having Self -Significance, Products and free 
responses may themselves be the phenomena whose measurement is desired 
(notice Examples A, D, and or they may be construed as evidence of 
some ability or knowledge (Examples B, C, E, F, and G). The painting, the 
dress, and the handwriting specimen have significance in themselves, and .fur- 
ther inferences based on them about pupils’ ability or skill may not be neces- 
sary. On the other hand, cartoons about history, solutions to mathematics 
problems, and answers to science questions usually have no significance in 
themselves. 1 hey serve only to indicate knowledge of history, ability in mathe- 
matics. or comprehension of science. When the products and free responses 
are themselves the objects of measurcmenl, maximum validity is to be ex- 
pected. But when they are measured merely as evidence of some pupil attribute, 
their validity is lessened. This is true because in addition to errors in defining, 
detecting, and measuring dimensions of the product there are likely to be 
errors in relating the dimensions ol the products to the dimensions of knowl- 
edge, skill, etc., of which they constitute evidence. 

Eliciting Free Responses 

Free-response tests are utilized because it is anticipated that free responses 
to test items more nearly approxim.ate a pupil’s natural actions than do his 
responses to objective or guided response tests. 1 hus, a first step in designing 
a frec-rLsponse procedure is to decide upon a ''natural'’ product for which 
measurement is appropriate, cither because the product itself' is educationally 
important or because it is demonstrative of something ek^e (a knowledge, at- 
titude, etc. ) that is important. A question or direction for free response is to 
be designed, then, which will draw lorth all or a portion of the selected “natu- 
ral” product or something that will approximate it. 

I o understand this relationship between free response and natural product, 
please notice Example F in Figure 7. Why, we may ask, is it significant to give 
a pupil an arithmetic problem about the dimensions of a parallelogram? The 
rationale behind use of a thought problem as a testing device in mathematics 
may run something like this. Pupils encounter mathematical situations in their 
everyday life and respond to them w'ith thought-out solutions or with written- 
out solutions. The conglomerate of any pupil’s solutions is what we mean by 
his “knowledge" of mathematics or his mathematical “ability," and the average 
competency of his solutions is the degree of his knowledge or ability. More- 
over, the quality of any given solution is consideied to be similar to the quality 
of all the others. By giving a pupil a mathematical problem on a lest, we can 
derive from him a “test product” that will be like one of the “natural products” 
collectively comprising his malhematical ability. 

To produce free responses that approximate natural products (in essential 
respects, of course) a number of verbal and graphic devices arc available. 
These arc the standard stimulations ol free-response techniques. In addition to 
their basic requirement for stimulating the desired type of response, it is essen- 



70 


FUNDAMENTAL CONCEPIIONS AND PROCEDURES 


Hal that they constitute the same stimulation for all pupils who react to them 
This means that they must be simply and clearly phrased, that they must avoid 
ambiguous words and constructions, and that they not assume special experi- 
ence on the part of pupils if they are addressed to an unsdected group ol 
pupils 

Directions to Write, Dtaw, Design, or Make Something Ihesc are ex- 
emplihed by Examples G and H m Figure 7, and, along with questions, arc 
the most usual type of free-response item used for measuring educational 
achievement 

Questions Examples E and h m the same Figure 7 illustrate questions 
Such interrogatory free-response elicitors are the “essay” questions ol long 
educational use condemned by some advocates of ‘objective measurement 
lor their ‘ subjectivity ” 

Open-End Statements^ In the altcmpted measurement ol pcrsonalitv 
lactors (interests, fears, etc ) it often is desirable that a person be permitted 
to write whatever he first associates with a given stimulus To elu it such tree 
associations open-end statements and stimulus words objects, and pictures 
arc used extensively 

LXiimples ot open-end statements aie 

My school IS 


I hate to 


The mothei and fatht i 


Stimulus M Olds and Ohjeets ' These olten are usc'd lor verbal response 
in counseling interviews, m psyehialrie examinations and are an integral part 
ol he detection procedures 1 xamples dre 

Write down (oi say) the first idei that occurs to vou vchen vou sec (or hrai) 
e jch ol the following woids 

Buttci 
Medicine 
Woman 
( nme 

* NOTt 1 he use of open end st dements and stimulus woids objects and pictures 
in personality measurement requires extensive speeiil tiaining They should not be 
devised, nor responses to them interpreted, by persons who lack this training, however 
skillful they may be in meisuiing educational achievement \s often as not, the rc 
sponses elicited are vcibal rather thm written Projective techniques are discussed at 
greater length in Chapter 15 



PRODUCT ANALYSIS AND FRER-RFSPONSL PROCEDURES 


71 


Write whatever comes first into your nflind when you see each of these objects 



Stimulus Pic(ut( s ' Pictures arc another type oi siiinulus tor tree associa- 


tion used in. personality measurement 
turcs, examples of meaningful stimuli, 
ot abstract or nonmeaningful stimuli 



A Murr > PilIuil 


fhe best known arc the Murray Pic- 
and the Rorschach Ink Blots, examples 
\ sample ol each is shown m Figure 8 



\ RoislIulIj Ink Hlot 


Figure 8 Sample pictuics from the Muira\ 1 ht^malic AppLiccption Tl^I ncl tht 
Rorschach Ink Blot Test (5) (Ihe former is reprinted b> pernussion of the publishers 
from Henry A Muiiay, Tfumaiu AppituiHion list Cambridge Mass Hirvard 
University Press, 


72 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


In addition to their use in personality measurement, pictures as stimuli to 
free responses frequently are used in intelligence tests. As a rule, such pictures 
suggest a given story or event, and a pupil’s response is judged to be adequate 
as he approximates the proper story or event. 

Stories to Complete, A little used but potentially valuable free-response 
elicitor is the incomplete story. This is found in some intelligence tests and 
most frequently in Language Arts instruction. For example, to gauge pupils’ 
skill at narration, a teacher in a tenth-grade English class writes on the black- 
board: 

“William Hunt planned to go on a bicycle hike with his gang one Saturday. 
His older sister teased him when he went to bed very early Friday night so 
that he would have lots of energy on Saturday. During the night he heard his 
dog barking once in the back yard. When he awoke on Saturday, he dressed 
for the hike and even before breakfast prepared himself several sandwiches 
for a roadside lunch. He took the^^ together with a poncho (in case of rain) 
out to his bike to tie on the luggage carrier. But his bike was gone!'’ 

“(Each of you finish this story jnsl as you think it ought to be finished. )” 

Problem Situations. I he problem situation has been used more and more 
extensively in recent years by vSocial Studies teachers to lest pupils’ ability to 
think critically. Mathematics teachers have been using the device tor many 
years whenever they have given a thought problem. In substance, a problem 
situation is a statement of variables and the need for their proper resolution 
The individual is required to examine and relate the variables in^ an appropriate 
way and present the right or a light solution. Example f in Pigurc 7 is one 
example of a problem situation. Another is this item from a civics test appro- 
priate to the eleventh or twelfth grade. 

“Some cereal companies claim that their cereals contain ‘anonoxin,’ 
which prevents tooth decay. Some dentists arc reported io have said that 
‘anonoxin’ docs prevent tooth decay. It is known, of course, that sugar and 
soft foods are factors in tooth decay, and that cereals arc largely starch which 
becomes sugar in the mouth, and that they need little chewing You arc a 
teacher and a representative of one of these cereal companies wants you to 
use a movie produced by his company. I he film is on health and, among other 
things, it says that ‘anonoxin’ prevents tooth decay. It does not advertise any 
cereal but merely says at the end that cereal company made the pic- 

ture.” 

“What should you do and w'hy?” 

Requirements for Frec-Rcsponse Elicirors. Any of these fice-responsc 
elicitors — questions, directions, stimulus words, etc. — must not in itself indicate 
what response is expected. DifTcrenccs in knowledge or intelligence or attitude 
among pupils should be what makes for their differential responses, not their 
ability to read. On the other hand, frcc-responsc elicitors should indicate 
clearly the type of response desired and its limits as to time, space, or detail. 



PRODUCT ANALYSIS AND FREE-RESPONSE PROCEDURES 


73 


Without these indications, pupil responses are not going to be comparable. 
To illustrate the desirable middle ground between items that contain clues 
to the right answer and those entirely too vague or unlimited, consider the 
following: 

I'oo Describe the development of the ‘‘Bill of Rights,” stating its first origins 
many in England, its relation to other documents, and how it became “attached” 
dues to the Constitution. 

Too 

vague Discuss the Bill of Rights. 

About Describe the important events that made the “Bill of Rights” part of 
right our basic national law. 


In preparing tests to contain Irec-responsc items, it is essential to leave 
appropriate space for the responses. Pupils tend to gauge the extent of their 
responses by the space adorded them. If a few phrases arc all that arc expected 
in response, a half page of space is likely to make conscientious pupils strive 
for additional statements and perhaps instigate repetition, padding, or even 
error. On the other hand, if many sentences arc needed for an adequate 
response, it is frustrating to the pupil to have only two lines in which to reply. 
In his frustration the pupil may abbreviate, cramp his writing, omit factors, 
or doubt his interpretation of the question, all to the detriment of his response. 
• 

Scoring Products and Free Responses 

The assignment of proper measurement symbols (numbers or letters) to 
products and free responses is, of course, the critical phase in these measuring 
procedures. The symbols most frequently applied arc classificatory in nature 
or are indicative ol rank. Scale measurement usually is not possible except 
in research situations where special devices arc used. In the example items 
shown, the painting in Example A has been given directly a rank symbol while 
the cartoon in Example B, the dress in Example D, and the drawing in Ex- 
ample G have been assigned a letter indicating their classilication or rating. 
In Example II, the handwriting specimen has been assigned a classilication 
number that may be interpreted as a rank number. The numbers found for 
Examples C, E, and F are still raw scores and may be converted into either 
class or rank symbols. 

With scores or marks restricted largely to the two least precise forms ot 
measurement, careful scoring of products and free responses is essential so 
that further imprecision is held to a minimum. The different scoring methods 



74 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

in general use include many that have such necessary precision, but, unfortu- 
nately, some also are widely employed which lack it. As a rule, scoring prod- 
ucts or free responses as a whole is a less reliable approach than scoring them 
piecemeal or factorially. The purpose of measurement and the time available 
for it may permit or require use of over-all scoring, but, if they do not, some 
sort of factorial scoring is thought to be essential. 

OVER-ALL RAfING 

1'he scoring of the cartoon in Example B, Figure 6, illustrates the direct 
assignment of a classification symbol to a product or a free response. The 
scorer in this case simply viewed the cartoon, decided subjectively that it 
belonged to a given class of cartoon, and assigned the letter A, which sym- 
bolized the classification, Jt might have been any other letter or number, or 
even a word, that denoted the classification. 1 he important thing is that the 
cartoon was considered to have characteristics similar to the classification. 

Of all methods of scoring products and tree responses, this is inherently 
the most unreliable, albeit the most rapid. A good deal of its unreliability may 
be allayed if the scorer has adequate directions, and ii he iollows the direc- 
tions in the same way for each product or free response he marks. The direc- 
tions (prepared by the scorer himself as a rule) should specify the aspects of 
the product significant for measurement. They should represent the classifiea- 
tions to be applied, and should contain ex«implcs of products that typify the 
several classifications. (The Binet manual contains exccllftit examples of 
directions for over-all rating of iiee responses [71.) Fatigue, boredom, and 
deadlines are among the principal reasons lor uneven reaction to a series of 
products or free responses Consequently, scoring should be attempted only 
when, and for as long as, one can sustain interest in the task and remain un- 
tired. To avoid having to mark the last of a series in a rush, the apparent solu- 
tion is to allot sufficient time in advance or, failing this, simply to miss the 
deadline. Principles stated for the use of behavioral rating scales (page 59) 
are applicable to the rating of products. 

OVER-ALL RANKING 

In Example A, the pupil’s painting has been measured by comparing it 
with the water colors of other pupils, and then by assigning it an appropriate 
rank in this group of paintings. In art contests evaluations frequently arc made 
simply as a result of ranking the art work submitted. 

Where products or free responvses are to be compared in one dimension 
only, the method of over-all ranking can be highly reliable, far more so than 
any over-all rating. The reason for this is obvious from the experience of any 
of us. Consider, for example, the judgments we make about musical tones. 
We can “rate” tones heard by calling them a, h, c, d, etc., a classification form 
of measurement. But only trained musicians can do this with any degree of 
accuracy. Or, we can say simply that this tone is higher, this is lower, and by 
continuing our comparisons, establish a rank order of tones from lowest to 



PRODUCT analysis AND FREE-RESPONSE PROCEDURES 75 

highest. This any of us but the tone-deaf can do and with greater accuracy 
than we can classify the tones a, b, c, etc. 

When several dimensions are to be the basis for comparison, reliability is 
lessened and thl? method, at some undetermined number of dimensions, be- 
comes as unreliable as over-all rating. This may be seen by considering thirty 
short compositions written by seventh-grade pupils. If these are to be ranked 
simply on the basis of proper usage, ranking is easy and 1 airly precise. But if 
they are to be ranked on usage plus penmanship, a hypothetical average ot 
some sort must be struck between the two, for each paper and the papers 
compared in relation to this average. If to usage and penmanship is added 
interest, even more vague averages must be estimated lor the paper; and il 
a fourth dimension is to be reckoned with, the teacher may with good reason 
decide to give up trying to rank them. 

To obtain maximum reliability in over-all ranking, it is necessary then to 
assign rank for one dimension only, and, obviously, it is necessary that this 
dimension be clearly defined. Jn addition, certain other procedures have been 
found to add to the reliability of the method First, all the products or tree 
responses to be ranked should be read or viewed before any can be ranked It 
there arc many products in the group (say 15 or more, as a rule of thumb) it 
IS advisable to compare each with every other, or at least to mal^c such itcm- 
by-item comparisons among any subgroup greatly alike Even more rcliabilitv 
may be attained by a second determination of rank and a reconciliation ol 
differences between the first and second runs. In research, and certainly as the 
basis for critical decisions, ranks assigned independent!) by several impartial 
but equally qualified judges should be pooled to determine final rank 

It may be apparent by this time that over-all ranking (though inherent!) 
a more reliable method ot scoring) is likely to be far more time-consuming 
than over-all rating. If done as quickly as over-all rating oiten is done, there 
IS no certitude that it will be any more reliable 

COMPARISON WniH A PRODlK I SC At L 

We have just established that product-by-pruduct comparisons ma) be 
made with some accuracy, but we also have observed that the process is time- 
consuming, and that it yields rank symbols onl\ A method ol comparison moic 
rapid and capable ot yielding classificaticm symbols as well as rank numbers 
should then be ot great value. Such a method is the product scale. 

The handwriting scales in use in elementary grades are the foremost in- 
stances eff this mcthcxl of measurement (notice Example // in Figuie 1) 
Stereotyped handwritten passages have been developed that represent the 
variety in pupil handwriting in a number ol gradations, Irom the least legible 
and attractive to the most. A pupil is asked to write the passage used in the 
scale in his own hand. The teacher simply finds on the scale the specimen 
most like the pupil’s, and the pupil’s paper is given the number or letter of this 
specimen. 

Except in handwriting, there arc few published product scales. It is difficult 



76 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


to standardize them, and in American schools most subjects lack the stand- 
ardization of content that might make them useful. The virtue of a published 
scale is that its numbers may indicate nationally significant classifications or 
rank in a norm group. In some scales an effort has been made to have specimens 
typical of various grade or age levels In others, equal differences in skill or 
ditficulty have purportedly been established among all the specimens contained 
in the scale. 

For subjects in which pupil products are likely to be of the same type for 
a number of years, it is possible for a teacher to develop his own product 
scales. To do this, you need to select from each group of products submitted 
by pupils several that arc representative ot the worst, best, and intermediate 
levels of attainment. When selections have bccp made Irnm three or four such 
batches, establish a rank order among all the ^elections, using fellow teachers 
as judges in addition to yourself. Reject all specimens upon which there is 
disagreement as to ordci. The remainder is >our pri>duct scale. You may 
assign numbers or letters lo the scale specimens and use your scale as you 
would a published scale Do not, of < oiirse, cittrihuic scale si^nijuancc to num- 
bers assigned to spccimcfis. If the scale seems to have loo many units, select 
5, or 7, or 10, or whatever number seems best but be sure that these products 
arc evenly spaced along the larger scale. 

FAC I OR RATING 

If a product or free response has more than one measurable dimension (as 
they generally do), and il all the dimensions are to be reckoned with in scoring 
the product or free response, both over-all rating and over-all ranking have 
serious limitations. As we have observed, }0u aie forced to derive mentalb 
some average among the dimensions, and actually lo rale or rank this average 
(which is only an idea) rather than anything in the product itself. Because 
this process is subjective and unsystematic, an indeterminate degree of error 
is certain. To avoid this error, it is ad\isablc to use some method of analytic 
scoring. 

This method is cxamplilied in Examples C, D, E, f , and G, m Figures 6 
and 7. Notice that in each case several things are measured, not just the thing 
as a whole, factorial scoring consists of the identification of factors that con- 
stitute the important propeities of the product, the separate measurement of 
each such dimension and, if necessary, the systematic combining of the 
separate measures into a single one. In factor rating (the method of analytic 
scoring to be discussed first) the dimensions are immediately assigned a meas- 
urement symbol, usually a number, which represents a classification or a point 
on some imaginary scale. 

Advantage of Factorial Scoring, fhe advantage of dealing with dimen- 
sions separately is that each measure then represents the status of a given 
dimension and that only. In over-all scoring, on the other hand, the same 
measure may represent different degrees of several dimensions and be indica- 



PRODUCT ANALYSIS AND FREE-RESPONSE PROCEDURES 


77 


tive only of their hypothetical average. To sec this distinction, consider the 
case of English compositions, two of which may have been given directly an 
over-all score of 50. The fifty for one paper could represent a hypothetical 70 
in grammar, 30 in style, and 50 in content, whereas the 50 for the other could 
stand for a hypothetical 50 in grammar, a 70 in style, and a 30 in content. 
With factorial scoring, the diflerent numbers would first have been actually 
assigned to the three dimensions and the dillerent status of the dillcrent dimen- 
sions of the compositions then characterized as dillerent. If a single number 
were needed to represent each composition, the separate measures of the three 
dimensions would still remain on the papers to explain what the 50 really 
meant. 

It must be recognized, of course, that proper scoiing of a product may re- 
quire attention to its over-all pattern or Gestalt as well as to its separable 
dimensions. If so, this over-all aspect should be defined as a separate dimen- 
sion and measured as any elemental dimension. Moreover, just because there 
may be such a holislic dimension is no reason to disiegard factorial scoring. 

Once the dimensions have been identified, it remains to rale them in ap- 
propriate ways. The general principle^ and rules for the Uj,e of behavior rating 
scales (page 59) are as germane to factor rating as they were to over-all 
rating of products Numbers or letters assigned to dimensions as their ratings 
are to be viewed as classification symbols only. If a single rating is to be de- 
rived from the dimension ratings, numbers must be assigned to the dimensions 
rather than letters or words. The single score may be a total or an average. ‘-‘ 
While precedent and administrative convenience may require such single rat- 
ings, they obscure as much as they reveal and yon arc advised not to use them 

instrnctiona! purposes. I he factor scores themselves tell the pupil his 
strengths and weaknesses and show/ the teacher in detail what he has accom- 
plished and what he has not. 

lixample F in Figure 7 illustrates a special case of factor laung, the “’cor- 
rect’’’ or ‘Incorrect” assigned to aspects of solutions to mathematics problems. 
In this case, it is assumed that only one type of response is correct tor each 
phase and all others are incorrect. It remains for the scorer to look only for 
one class of responses, correct ones. He need not discriminate among the 
others. The distinction between a correct response and an incorrect cme pre- 
sumably is clear cut and constant, and it this assumption is borne out, this 
special type of factor rating can be highly reliable. 

hAClOR rOUNTINCi 

A more reliable method of factorial scoring capitalizes on the precision of 
enumeration, Ytni know, of course, that vve make fewer errors in counting 
than we do in most of our perceptual activities. In fact, counting is about the 

“It is considered mathematically invalid to apply addition and division lo classifica- 
tion numbers (see page 9). However, custom has long condoned and probably will 
continue to condone the practice in the .scoring of products or free-responsc test questions. 



78 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


only thing man is still trusted to do in the physical sciences. His color dis- 
crimination has been replaced by spectography; his sense of temperature, by 
the thermometer; his auditory discrimination, by electronic devices; and, of 
course, his senses of distance and weight by calipers, meter sticks, and scales. 
But -he still enumerates many things in the laboratory and his counts are ad- 
mitted as scientific data (if he is properly trained, of course). 

Advantage of Factor Counting. In addition to relative freedom from 
perceptual errors, enumeration usually can avoid the influence of bias and 
other subjective souices of error. It is relatively easy to say that this is a 
“good” paper to a pupil whom you like, that this is a “bad” paper to a pupil 
^^hom you dislike, and yet have the papers as judged by others be equivalent 
in value. It is relatively difficult on the other hand, lor your prejudices to affect 
your count of spelling errors in the two papers. Moreover, counting may be 
continued with little loss in accuracy despite latigue, boredom, or malaise, 
while these same things can make qualUative appraisals entirely unreliable. In 
fact, the stigma of subjectivity usually is lifted from behavioral measurement 
when simple enumeration is involved. 

Dimensions Must Have a Quantitative Aspect. Factor counting requires 
first that each dimension to be measured be defined in terms of things that 
can be counted. In Example C in Figure 6, good grammar was defined as the 
absence of three types of error: punctuation, spelling, and syntax, all of which 
errors can be noted and counted. Moreover, unless a dimension can be defined 
in terms of things subject to enumeration, it must be rejected iffactor counting 
is to be the only scoring method applied. 

For each product, we count the factors that belong to each dimension of 
concern. These numbers may be left as such, as raw measures of the dimen- 
sions, or they may be converted into rank or other derived numbers. ^ If neces- 
sary, they may be combined into a single score for the product just as were 
factor ratings. 

Applicability of Factor Counting. A great many of the educationall} 
significant dimensions of products and free responses are amenable to factor 
counting. Some of these, along with illustrative items to be enumerated, arc 
shown in Table 1. Other dimensions (not so readily susceptible to a counting 
approach) may, with ingenuity, be so defined. For hundreds of years literary 
critics have insisted that a writer's style can be described only in qualitative 
and figurative terms. Now, authors of formulas for judging the difficulty and 
interest of books are asserting that many enumerative items collectively con- 
stitute style, i.e., length of words, length of sentences, frequency of personal 
referents, etc. 

Factor counting, as a method of scoring products and free responses, has 
such a great potential for increasing the reliability of these procedures that it 
deserves the w'idest possible use. Both product and free response appraisal fell 

3 The conversion of raw scoies into rank indexes or other derived numbers is 
discussed on pages 155-160. 



PRODUCT ANALYSIS AND FREE-RLSPONSE PROCEDURES 


79 


TABLE 1 

Some Pupil Products Paiticularly Amenable to 
Factor Count Sconiit 


.... .... ... ... 

Products 


lUustratne factors 
to be counted 



Spelling errors 

English compositions or thv.mes 

Us ige J 

Punetu ition errors 

Woid form eriors 

Sen U nee cnors 


1 

"DilTcrcnl sentence form 


1 

Slvlc 1 

Mean cntencc length 
consliuctions 

F iguj ts of speech 


C liches 

Rcpctiti t adjectives 

Reports rnd uiswtis to tiee i espouse | } acts or accurate idus 

questions ii | Cojrect st itemcnts of 

^ dionship 

I Correct genet ilizatioii 

History 
Cicogr iph> 

Ocner il Science 
Bio!oi»y 


fnU) disrepute as means ot measuring educational achievement bec u sc ot their 
unreliability Thcr intiinsic v ilidil} Ins seldom been cjuestionet^ Conse- 
quently, if then rehabilit\ e m be increased to approximate that ol the ‘objee 
tive test, an excellent cdueatioml loc^l his been reel umed 

1 AC lOK WlIGHllNC 

A frequent entieism ol laetornJ seoiine is that tuna nny be L.ivcn undue 
significance and eiitical points rna} be slighted because all c rc added together 
to produce the single number that is the products or free response score 
fhh condition ma\ be avoided lust by not deiiving a total seorc But it one 
IS needed, the relative signilieanee ed dillcrent factors ma) be lelained by some 
system ol factor weighting 

Impoftame Wci^hum, I wo methods ol weighting are in common use 
One bases the weight ol factors on their estimated importance and this method 
is the one illustrated in Example h in Figure 7 In this ex imple, as in all eases 
of importance weighting, the different numerical values assigned are arbitrary 
and the decision that one dimension is more important th m another has little 
empirical justification II, of course many teachers and authors in the field 
have agreed on a relative order of importance theie is at least the authority 



80 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


of consensus on which to depend. Again, should certain dimensions be com- 
posites of a known number of lesser dimensions, there is some logical justifica- 
tion for weighting the composite dimension with a number equal to its com- 
ponents. However arbitrary and subjective importance weighting may be, if 
done carefully, and frequently corrected, it helps to make single scores for 
multidimensional items more valid. 

Cook has derived from many sources certain criteria for the significance 
of items of learning that may have a bearing on factor weighting. These are 
''(a) the frequency with which it will be used, (h) the cruciality of the situa- 
tions in which it is needed, (c ) the extent to which superior individuals use it, 
(ci) the universality of the need in dificrent vocations, (e) the universality of 
the need in dificrent geographic areas, {f) the universality of the need in differ- 
ent time periods, (g) the dilliculty of learning, (h) the frequency of error, and 
(/) the frequency of use in a specific vocation” ( 1 ). 

Difficulty Wei^htin^. The other method i)f weighting is a function of the 
difficulty of the dimension, difficulty being applicable only to tests of skill, 
knowledge, or intelligence, this method is usable only in these areas. The rela- 
tive difliciilty of factors may be based on opinion, as it is with importance, or 
it may be based on the performance of pupils with respect to the items. 1 he 
latter procedure is the moie objective and systematic of the two and, hence, 
the preferable one. It is explained by means of the follovAing example. 

In a biolog} class, the teacher recurrently tests pupils by asking them to 
describe an animal typical of the lile form they arc stud}ing at the moment. 
Over a period of years he has classified and tabulated the statements they in- 
clude in their answers and determined tor each classification the percentage 
of statements that are correct. His cumulative tabulation resembles this.-^ 

l\ t cent of pupih who 
an.wvrr 


Classification 

Essential physical 

chaiactci istics 70 

Significance in 

human affairs 50 

MetluHl ot reproduLtion 45 

dype ot lood, hosts, or prey .'^5 


On the basis of these percentages, the teacher gives different weights to 
correct statements according to the factor they represent. The weights arc: 


Classification 1 

Physical char aclei istics 1 


4 The percentages arc illustiativc only, ard do not nulUatc the cutnal relative difficulty 
of these factors. 



PRODUCT ANALYSIS AND FREE-RESPONSE PROCEDURES 


81 


Significance jn human affairs 2 

Method ot reproduction 2 

Type of food, host, or prey 3 


Any pupil’s total score on the test then becomes the sum of the correct 
statements multiplied by their proper weights. A pupil’s paper might produce 
a score ol 17 by stating 3 correct items ot classification, 5 of physical charac- 
teristics, 2 about the animal’s social significance, 1 on reproduction, and 1 on 
food, thus: 


3 X 1 3 

5 X I -- 5 
2 X 2-^4 
I X 2 2 

1X3 = 3 

17 

General Si^nifnance oj Weigluin}.^. The principle weighting is as ap- 
plicable to a test as a whole as it is to the ansv\cr to any frec-response item 
T ime or length of response as well as importance and difficulty may be used to 
determine proper intentem weight. 

ADDlllONAL PRINCIPLLS 1 OK SCORING PRODUCIS AND I RT I Rl SPONSES 

From the desciiptions of ()\er-all rating and tanking and ot factor rating 
and ranking as methods ol scoring, you may have mlcrred that each method 
W> to be used sepaiately and exelusivel} . I o the contrary, effective scoring ma\ 
require a combination ol methods since products and Irec-respoiise tests are 
not apt to fit any given stereotype of scoiing. Hence, it is aclMsablc to use a 
sectring system that is Ixxt adapted to \our purpose in measuiemcnt and to 
the types of products or Irec responses >ou intend to appraise. Combine fac- 
tor counting and factor rating procedures as vou need to handle chlTcrcnt types 
of dimension. If >onic free responses arc umdiniensional, while others are 
multidimensional, use ovei-all rating as well as facto; lating. The ultimate 
criterion for product and fiec respemsc scoring is that mea'-urement s\mbols 
be assigned to them that characteri/e their status most precisely, not that a 
given method of scoiing be used. 

The system of scoring devi’cd for a given Ircc-response test or product is 
the scoring '‘key.” It will not pcimit automatic scoiing as will the kc\ to a 
guided response test, but it will insure maximum reliability in ratings, rankings, 
and counts. The key should contain as a minimum the numbers or other sym- 
bols to be assigned, the dimensions to which they are to be assigned, climcil- 
sion or item weights, the significance of varying numbeis or letters, and a 
formula for a total score, if one is to be derived. Where products or free 
responsLS relate to an area of knowledge, the elements of know'ledge repre- 



82 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


scnted constitute the most important dimension of the products or free re- 
sponse. Consequently, the facts or relationships that would comprise an ideal 
product or response should be written down. They will serve as a reference 
point for rating answers or as reminders of correct factors if a counting proce- 
dure is used. 

In scoring frec-response tests, it is advisable to score all papers on one 
question, then all papers on the second one and so on, rather than score one 
pupil’s paper in entirety before scoring the second paper. By this means, atten- 
tion may be kept on a single focus and greater speed is possible. Moreover, 
each pupil’s response to any question is more likely to receive equitable 
scoring. 

Applicability of Product Analysis and Free-Response Procedures 

Any phenomenon of educational significance written, diawn, or made by 
a pupil, or closely related to something so wiitlen, drawn, or made, may be 
measured through product analysis and frec-response tests. \ he procedures 
then are widely applicable and are used in all school grades and subjects. 

Product analysis is the primary means of measurement in shop instruction, 
in homcmaking and art, and is of great importance in language arts and social 
studies. Free-responsc procedures are emplo\ed extensively in social studies, 
language arts, mathematics, and science. In addition, both products and free 
responses are used by home room teachers, counselors, and school psycholo- 
gists to appraise personality variables. Pupil “autobiogiaphies” and paintings 
are the two products most widely used lor this purpose. 

The two procedures are recommended whcie they arc applicable because 
of their inherently high validity As we have seen, their icpuled unreliabilil^v 
is as much a function of the way they are used as of the techniques per se 
Properly used, they can be highly reliable. A tree-response procedure should 
be selected in preference to product analysis if strictly comparable measures 
are necessary for a group of pupils 

Summary 

What pupils produce in school — compositions, drawings, notebooks, tic 
racks, and dresses — often arc employed in educational measurement as well as 
are the free responses they make to certain test questions. Such instructional 
or test products are appraised directly w'ith unaided senses and olten are 
evaluated quickly and subjectively. Thus, product analysis and free-nsponsc 
tc’^ts are obedient to much the same principles as observation. 

Free responses in tests are elicited by directions to do something, by ques- 
tions, by open-end statements, and by stimulus words and objects, stimulus 
pictures, stones to complete, and problem situations. Both products and free 
responses are susceptible to various methods of scoring. In reverse order of 
reliability these methods are over-all rating, over-all ranking, comparison with 
a product scale, factor rating, and factor counting. The last two methods in- 



PRODUCT ANALYSIS AND PREE-RESPONSE PROCEDURES 


83 


volve the identification of the dimensions that comprise any jfroduct or test 
response and the separate rating of each or the separate enumeration of the 
components of each. In factorial scoring, weights of different size may need 
to be assigned to dimensions or the elements thereof that are of varying impor- 
tance or difficulty. 

In practice, the scoring of products and free responses may require a com- 
bination of methods. The system of scoring devised for any pioduct or free 
response item is its “key” and should be rigorously followed m scoring. In 
scoring frcc-responsc tests, it is advisable to mark all papers on one item, then 
all papers on a second item, and so on, rather than to score all of one paper 
before scoring the second paper 


LXERCISLS 

1. List the pupil products that arc of importance to the sub'ect or giadc m 
which you specialize, and state for each several dimensions that should be evalu- 
ated. 

2. Prepare a free-responsc test of at least five items for the subject or grade 
m which you specialize Devise a “key” for scoring responses to each question 

3. Obtain sample pupil products for the grade or subject m which you 
specialize and score them fiist by over-all rating and second, bv factor rating 
or counting. 

4 Put these same products m order ol excellence liom best to worst Select 
at least five that could constitute a rough product scale 



CHAPTER 6 


GUIDED RESPONSE PROCEDURES 


In the preceding chapter we discussed the use of free responses to test 
questions as means of measuring behavioral phenomena. We tound that the 
scoring of such answers tended to be unreliable, and we presented various 
techniques for making them less so. Now, we shall turn to the body ol piocc- 
durcs that have been developed in an effort to free educational measuicmcnt 
from the unreliability bug-a-boo, the true-false questions, multiplc-choicc 
items, matching questions, short answer, and fill- in items, and all the other 
guided response items that constilute objective tests.- 

In our treatment, we first shall illustrate the many possible types of items 
and discuss briefly their basic attributes. Second, we shall examine the lornis 
of measurement that we can hope to derive from guided respt^jse le^ts. After 
this we shall describe the const! uction of guided response instruments and 
their administration. Finally, the chapter will conclude with a short discussion 
of “standardized tests,” this being the term ascribed to published guided re- 
sponse instruments for which norms are available. 

^ “Guided response procedures" is used as a categorical term for tiuc-false, multiplc- 
choicc tests, and the like, in preference to the more customary “objective tests" lor the 
following reasons. The term “ob)cctive tests” has no ofticial or even semiotlicial standing, 
as a technical phrase. It does not derive from any system.itic classilicalion of behavioral 
measurement procedures. The word “obicetive” can refer only to a single aspect ot 
testing, that of scoring, while the construction and administiation of such tests is as 
subjective a process as the construction and administiation of any tests. Moi cover, the 
phrase “objective tests” connotes (to the authors, at any lale) ttiat the procedures arc 
free from human error, which of course, they arc not. 

On the other hand, the term “guided response procediiies” docs derive from a 
systematic effort to classify the many activities and artifacts of behavioral measurement. 
It refers to what is thought to be the most salient characteristic of this appioach to 
measurement, namely, restricted subject responses. And, finally, “guided response pio- 
cedures” should not invoke any stereotyped connotation of infallibility 

^For a discussion of the history of such tests, sec C. C\ Ros-^ and 1. C. Stanley, 
Measurement in Today’s Schools, New York: Prenticc-Hall, Inc., 1954, chap. 2. 


84 



GUIDED RESPONSE PROCEDURES 


85 


TYPES AND ATTRIBUTES OF ITEMS 

The variety of test items used to elicit guided responses from pupils is very 
great/'^ In this variety, though, it is possible to detect three basic categories of 
expected response. In many cases, pupils must select an answer Irom among 
several possible ones; in others, pupils must provide the answer themselves; 
and in still others, they are required to arrange words or objects into a proper 
array. We shall illustrate each of these three categories of items, and in so 
doing, describe the types of items most prevalent in guided response tests. It 
must be recognized that many others exist or can be devised and that combina- 
tions are possible and often desirable. Moreover, it should not be assumed 
that the precise form of any example is the only form for that type of item. 
Numerous variations arc and can be practiced for each. 

Selection (if an Answer. Probably the greatest number of guided response 
tests employ questions for which pupils arc expected to select the right answer 
from among those given. The appearance, the number of options, and the 
means of selection varies, but the principle of selection holds alike for true- 
false, multiple-choice, matching items, and all their variants. These are illus- 
trated in Figure 9. 

Example 2 in Figure 9 is similar to the other three in appearance, but it 
dilTers from them in its significance. The item is illustrative of the selcct-an- 
answer items used in personality and altitudinal measurement to gauge feel- 
ings. 

Provision of an Answer. The items illustrated in Figure 10 require that 
*a pupil provide his own answers, not just select one. Answers are limited to a 
word or phrase but, unlike the sclect-an-answer items, interpretahon does 
play a small part in ^coring. Notice Example 2 in Figure 10. In answer to the 
question, “What makes a cake rise?” pupils might respond with CO 2 , carbon 
dioxide, gas, heal, baking powder, or soda. Arc all these to be marked correct? 

Because interpretation is involved (though far less than in free-icsponse 
items) and because they may not be scored automatically, this type of item 
is not used in most standardized group tests. They arc a mainstay of stand- 
ardized individually administered tests, however, and they are possibly as 
widely used by teachers in their self-devised instruments as are the sclect-an- 
answer type. 

Arrah^enient of p'danents. The third basic category of guided response 
items illustrated in Figure 1 1 requires neither selection nor provision of an 
answer, but rather the arrangement of elements into a pattern. Of the three 
categories, these arc the least used. Although scoring may be done with an 
exact key (no interpretation needed) the format of answers often precludes 


A recent summary of research in achievement testing (9) states that there are no 
less than fifty different objective test techniques in common use. 



86 


I^UNDAMENTAL CONCEPTIONS AND PROCEDURE! 


J True-false 

Circle r or F 

1 F In a city managei form of goveinment, the mayor has little power 

2 A qate-Neutral 

Ciicle Yes, No, or U (Undecided) 

Yes No U Minority groups should be more aggressive when faced with rcstiic- 
tive real estate covenants 

3 MuUipU C hoit e 

Select the option that makes the statement correct 

Rainfall in Nevada is very light because 

a Nevada has no large bodies of watci 
b The prevailing winds aic noitheily 

c The Sierra Nevada Mountains screen the slate from moisture- 
laden air 

d Desert areas have a moisture evaporation rite too low for clouds 
to form 

4 Matching 

Select the inventor trom the right hind column who goes with each thing in the lelt 
hand column 

a Marconi 
b Colt 
L Howe 
d Fdison 
L Kettering 
1 1 leld 

Franklin 
h Whitnc> 
i Watt 


Figure 9 F samples of guided response test items which icquiic selection of an answci 

automatic scoring Moreover, the items obviously are applicable only to sub- 
jects in which arrangement has particular significance 

Devices Employed by Guided Response Items 

As you may surmise from the test items illustrated in Figures 9, 10, and 
1 1 , guided response procedures exploit all the basic devices by which measure- 
ment symbols may be assigned to behavioral phenomena Standard stimuli 
are present in the questions themselves; it is assumed that all pupils perceive 
the same thing when they read the same item In general, this assumption is 
warranted except for poor readers and foreign-speaking or culturally atypical 


1 Colton gin 

2 Llectnc staiiei 
_3 Steam engine 

4 Wiielcss telegraphv 

S Atlantic cable 

6 Sewing machine 




GUIDED RESPONSE PROCEDURES 


87 


1 Fill-in or CompUtion 

Write the correct word or number in each spice 

The usual automobile wet cell battery contains acid, and has a 

potential of _ volts per cell 

2 Shaft Answ ct 

Identify the following with a word or phrise 

The Duke and the Dauphin 

Brom Bones 

The Deacons Mastupiece 

Hia\sdtha _ 

Answer the following with a number word, or phrase 

Whit makes a cake risc*^ ^ _ 

Whdt oven temperature is called nioditau ^ 

Whit qua! ty do egg whiles give to baked dishes’ 

Laheling 

Label the p irts of a living cell as shown in this diagram 



Figure 10 Lxamples of guided response items which rcquiit piovision of ai inswer 

pupils The limited number of possible answers, each with picdetermmcd 
significance, makes for standard n sponses to the items Guided response ques 
tions may be scored with a key, and thus they embod> standard analysis m 
perhaps its purest foim Finally, as we will see later when we describe the 
construction of guided response tests, the test items sample that which they 
purport to measure They do not, as a rule, survey the whole of it 

Advantages and Disadvantages of Guided Response Items 

Guided response tests employ standard stimuli in common with free-re- 
sponsc tests They utilize standard analysis and must give attention to sam- 





GUIDED RESPONSE PROCEDURES 


89 


pling along with all the basic procedures of behavioral measurement. Their 
unique attribute is that of standard responses and this gives the procedures 
their primary characteristics. 

By having options of response and their meaning predetermined, it is 
possible to use a scoring system that involves little or no interpretation by the 
ftcorer and thus the subjective element in scoring can be eliminated or at least 
made negligible. It is possible, furthermore, to make comparisons among 
pupils on the basis of lest scores since the lest behavior of pupils is limited to 
a few options, each having the same meaning for all pupils. On the disadvan- 
tage side, limiting test behavior to a “yes” or “no,” to a choice of h, c, or 
d, etc., automatically limits the pupil variation that the test can measure to 
the amount and kind of variation expressed in the options. Thus, where varia- 
tion among pupils in a given subject is known to be greater than this, a guided 
response test may have limited validity. Moreover, standardized responses arc 
best adapted to questions with exact and limited answers. In consequence, 
guided response tests can handle well the aspects of any subject that are exact 
as are the names and dates of history; but they are far less adequate for the 
indeterminate aspects, such as the causes, trends, and principles of history. 

The items we have illustrated have been designed for WTitten response. 
However, guided response procedures are as applicable to oral testing as they 
arc to written testing. Many of the most valid intelligence and personality 
tests involve oral as w'ell as written directions, and oral as well as written re- 
sponses. In oral testing, the three types of items described are all to be found, 
and the four devices of standard stimulations, responses, analysis, and sam- 
pling are utilized just as they are in paper-and-pencil tests. 

Forms of Measurement to Be Derived from 
Guided Response Test Scores 

Raw Scores as Classification Numbers. The number of guided response 
items that a pupil answers correctly or in some given way is called a raw 
score. ^ This raw score, 10, .">5, 72, 113, or anything else, is essentially a clas- 
sification number \^ith respect to the test. It says simply that on this test this 
pupil answered this number of items correctly. It is not a very precise clas- 
sification number because two pupils might have the same raw score and yet 
have answered diferent questions correctly. 

The fact that the raw scores on guided response tests are classification 
numbers is consistent with the function of the separate test items. Each item 
is designed to separate pupils into distinct categories. In tests of skill, knowl- 
edge, and intelligence the categories usually arc two, those who answer the 
item correctly and those who cannot answ'cr it correctly. In tests of personality 
and interest, items may yield three- or fourfold classifications but the function 
of the item still is only to classify. Any pupil’s raw score then represents the 
number of times he is classified in a given way by the items of the test. 

4 In some cases it may be the number incorrect, but the .significance is the same. 



90 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


The Assumption of Equality Among Items. When a raw score is ob- 
tained in this manner by adding the correct or any other given category of item 
responses, the identity of the item responses is lost. Because of this, certain 
assumptions about the items must be made if the raw scores of different pupils 
are to be compared fairly. It must be assumed that all items have approxi- 
mately the same difficulty or that their difficulty increases by even increments, 
that they have equivalent significance for the subject, and that all items have 
equal validity. To the degree that these assumptions are unwarranted, the same 
numerical scores may indicate different status and different scores might indi- 
cate equivalent status. 

Conversion of Raw Scores into Rank and Stale Numbers. The classifica- 
lion numbers that arc the raw scores may be umverted into rank symbols 
either plain rank or percentiles. If the test has had the validity, importance, 
and difficulty of its items established, the scores may under certain circum- 
stances also be changed into scale numbers The rank conversions may safely 
be made for raw scores obtained from the tests designed by teachers them- 
selves, since rank differences imply no given differences in size or quality 
Scale numbers, on the other hand, do imply such a condition and unless 
teacher-designed test items have been analyzed for difficulty, importance, and 
validity, raw scores may not be converted into scale scores. Statistical proce- 
dures for converting raw scores into rank and scale numbers are described 
in Chapter 7, pages 155-158. 


THE CONSTRUCTION OF GUIDED RESPONSE TESTS 

A guided response test is any collection of guided response items which, 
as a group, purports to measure something: knowledge ot Spanish grammar, 
skill in copyreading, attitudes toward school, intelligence, personality struc- 
ture, remembrance of the fifth chapter in Civics, anything. 1 he test may be 
short or long, homogeneous as to type of item or heterogeneous, written or 
spoken, cover a week's study or a year's, and have a diagnostic or a survey 
purpose. 

If the test is to be as valid, reliable, and efficient as possible in the light 
of what it measures and the purpose for measuring it, its construction is a 
complex and time-consuming task. Once a guided response test has been 
constructed, however, it may be used and reused with far greater economy ol 
time than any other educational measuring procedure. Moreover, a great deal 
IS known about guided response tests and one made in accord with this knowl- 
edge is likely to be an effective measuring instrument. 

Some of this knowledge has come from systematic research, some from the 
reported experience of the users and designers of standardized tests, and still 
other from the practices of countless teachers. The directions for the con- 

•’^See Cook (9) for summaries of research on the construction of tests. 



GUIDED RESPONSE PROCEDURES 91 

struction of guided response instruments, which follow, are derived from all 
these sources and as well from the experience and reflection of the writers.® 

Definition of Phenomenon and Dimensions 

Define exactly the phenomenon to be measured and its measurable dimen- 
sions in behavioral terms. The essential first step in preparing a guided re- 
sponse test, or a free response test for that matter, is to define exactly what 
you wish to measure and specify its dimensions in behavioral terms. This is in 
keeping with the conditions of measurability discussed in Chapter 2, which 
any phenomenon must approximate if it is to be measured. 

To illustrate this first step in test construction, consider a teacher who 
wishes to appraise his pupils’ knowledge of geography. He may begin by 
stating that “knowledge of geography” means “what pupils can remember, 
reason, and do that relates to the body of scientific knowledge about the earth, 
Its climate, and the significance of each.” Notice that the word “knowledge” 
now has been “spelled out” with words less abstract, that represent types of 
pupil behavior. Notice furthermore that “geography” has been reduced to 
elements more tangible and less general, and that the phrase, “body of scien- 
tific knowledge,” defines the ultimate model of the content of any student’s 
knowledge 

This matter of a “model” is particularly important lor test construction. 
Obviously, il a pupil’s understanding or knowledge of a subject is to be 
measured by a guided response instrument, there must be definite information 
about the possible content of the understanding or knowledge. Only this will 
enable the construction of test items that will properly sample the extent and 
depth of any pupil’s knowledge. It is usual in testing for knowledge of sub- 
jects to assume that any pupil’s knowledge approximates the body of organized 
knowledge that is the subject, howx'ver slight is the approximation and how- 
ever erroneous it may be. 

Another source of information about the content of pupils' understanding 
is the free writing and discussiim of pupils. These may be collected and re- 
corded over a period of time and from them may be outlined a “model” of 
knowledge or understanding. This procedure is applicable when the object of 
understanding of knowledge is not a defined and organized subject which is 
taught as such. Because, however, most school instruction relates to defined 
and organized subjects, this second method of determining the possible con- 
tent of pupils’ knowledge is little used by teachers. 

The next step for the teacher is to have at hand or to obtain the facts and 
generalizations that constitute the body of scientific knowledge of the earth 

6 Detailed procedures for test validation and standardization are beyond the scope 
of this text. These are described in such books as Travers, Ldiicational Measurement 
(37), and Bean, Construction of Educational and Personnel 'Jests (2). Techniques for 
finding an index of reliability and a standard error of score arc presented in this text 
on pages 168, 181-186 



92 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


and climate which, by definition, must be the basis for the content of the test. 
Since the teacher has elected to measure “knowledge of geography” without 
any qualifications, the body of facts and generalizations on which he bases his 
test must be very comprehensive. 

The ideal source of such facts and generalizations is all that has been 
published in scientific journals and books about earth and climate. The prac- 
tical source for the teacher is a representative sample of these publications. 
This may include several geography textbooks and recent pertinent articles 
in geography journals. He may use a syllabus or course of study. It may simply 
be appropriate chapters in the textbook the pupils read, and/or his lecture- 
discussion notes for the course. As you may surmise, these sources are likely 
to provide progressively less adequate samples. 

The use of a single pupil text or instructor’s notes as the only source ol 
content for a test of knowledge is somewhat justified if the test has only to do 
with understanding the text or the lectures and discussions. pAamples of tests 
with such limited purposes are those of chapter or unit, or any designed to 
see what pupils have learned during a given instructional interval. However, 
even where the phenomenon to be measured is thus limited, there are serious 
deficiencies in the use ol a single source. If the purpose is to sec what pupils 
have learned, it is easily demonstrable that they alwa>s have learned more 
things, and things different from those the teacher and the text have presented. 
Tf the purpose is to sec what has been gained from a chapter, restricting test 
questions to the content of the single chapter prevents measuring any applica- 
tions pupils may have made of ideas in the chapter and any ideas having a dual 
source; for example, those that come in part from page 72 in the text and in 
part from an instructional film. 

The problem of determining the factual content on which a test is to be 
based is peculiar to the measurement of subject achievement or aptitude, but 
it has a parallel in the measurement of attitudes, interests, and other attributes 
of personality. Just as it is necessary in measuring knowledge to have a definite 
idea of what the pupils may know, so in the measurement of personality at- 
tributes by guided response tests is it necessary to have bclorehand some 
model of the probable feelings and behaviors of persons po^scssing varying 
degrees of the attribute in question. Often the principal sources of the model 
arc the actions and feelings of those who: 

1. Possess the attribute to an extreme degree and 

2. Possess none or a minimum of it 

In the design of vocational interest tests, the successful practitioners of 
given vocations often are the first group and those w'ho have failed in them and 
those in antithetical callings are the second. 

Now that the teacher has clearly defined knowledge of geography in be- 
havioral terms and has specified its possible content, it remains for him to 
determine the dimensions he intends to measure. Some or most of the dimen- 
sions are suggested by the content itself. The facts, generalizations, ideas of 



GUIDED RESPONSE PROCEDURES 


93 


relationships, nomenclature, and the like that constitute the science of geog- 
raphy are the properties to some degree of any pupil’s knowledge Other 
significant dimensions are to be seen in the pupils’ actions in relation to such 
content the applications they make of it, the abstraction level(s) at which 
they operate, and the errors in their concepts 

Accordingly, the teacher in our illustration decides that he will measure 
each pupil’s knowledge ol geography with respect to 

1 What and how many geographic terms he can identify (valley, cumulus, 
etc ) m each basic area of geography (sun-earth relations, maps, land fea- 
tures, winds, etc ) 

2 What and how many basic facts he can state (maximum declination of 
the sun at the equator is 23 Vi”, rocks in mountain streams are loundcd, etc ) 
in each basic area 

3 What and how man> generalizations he can make about causation, 
relationship, significance, etc , in each basic area (i e , the energy ot ^^orms 
conics from the sun’s heat) 

4 What geographic skills he can perlorm (read and make maps, deter- 
mine direction, etc ) 

5 What applications he makes of his knowledge (i e , says that Nevada 
probably never will have a very large population) 

6 His errois in an> of these above 

The construction ol guided response instruments to measure achievement 
in othci subjects or to measure personality, intelligence, etc , may require the 
designation ot entirely dificrcnt dimensions 1 he nature ol these is discussed 
in Section II when the specific school applications ol measurement and evalua- 
tion arc discussed However, it needs to be re emphasized that one or more 
dimensions must be designated for tin phenomenon before a guided response 
mstuiment may be constructed to measure it and that these dimensions must 
approximate ceitain eonditie^is ol measurability As you ma} recall, the 
critical conditions for educational measurement are that the dimensi ^ns shall 

" A CISC for an txreplion to Ihis rule might be nude so e died cut and tiy pro 

ccdurcs of lest tonsil action irt used The gist of ihtst protediirts is to establish criterion 
gioiips ind then to hnd test questions that disci iminatc btlwecii them most shaipN 
without any p irtieulai concern over whit dimensions the questions appraise foi c\ 
ample, li you wished to mtisuie ae idemic apUtude >ou could seleet one bundled pupils 
whom ttieheis r ited as cveellcnt students und nother one hundred whom teachers 
rated as very poor students devise numerous lest questions on any basis and on any 
suhiect administer them to the two groups and select tor your test of academic aptitude 
the questions which a high percent ige ot the cxcellet t students and a low percentage of 
the poor students answcied correctly In this procedure, howe\er, the oiigmallv devised 
question^ ntecssiii]> reflect some tacit and perhaps unconscious hypotheses about the 
dimensions of acrdemic aptitude 1 hc> certainly vere not driwn at random from a 
bdirel libeled ‘lest Queslions, Miscell ineoiis IJnsorted ’ Consequently, wc propose 
that dimensions he explicitlv designaitd prioi to test constuiction even if empirical 
findings on item discrimination are to be the bisis for test revision It is thought that the 
great majority of test theorists agree with this position See, in particular, Flanagan (14), 
and Travers (37) 



94 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

1. Provide sensory data 

2. Be clearly defined 

3. Produce consensus among impartial and unrelated observers 

In addition, if pupil achievement is to be evaluated as a result of a guided 
response test, dimensions should be selected in view of the standard on which 
the evaluation is to be based. More often than not, a guided response instru- 
ment is used not merely to measure a pupil’s status but also to evaluate it. 
Should a performance scale be the evaluative standard, it is e;^scntial that test 
items be keyed to each of the several levels of the standard. For example, if 
the standard contains a level of evaluation and interpretation of facts as 
distinct from simple knowledge of them, the dimension evaluation and in- 
terpretation must be measured by the test as well as the several dimensions ol 
factual knowledge, (See Chapter 9 for a full discussion of evaluative stand- 
ards and their implications for measurement. ) 

Preparation of Items 

Devise verbal and/or ^taphic items, the responses to \shkh automatically 
will classify pupils with respect to ghen dimensions. As we have seen, a single 
guided response item is capable only ol classification and, in most instances, 
twofold classification. Pupils usually will show many degrees of difierence for 
any measurable dimension and total lest scores may be considered to rellect 
this “continuous” variation. However, so lar as itcm-by-ilem cesponses arc 
concerned, the continuous variation has been broken down into a series of all- 
or-none variations.^ For example, take the first dimension that the geography 
teacher in our illustration wishes to measuie, “What and how many geographic 
terms a pupil can identify,” Pupils may be expected to vary from those who 
know practically none of the terms through those who know intermediate 
amounts to the few who know a great number. No two pupils, in actuality, 
are apt to know the same ones or the same amount. But when the pupils re- 
spond to a question keyed to this dimension (i.c.. cumulus means a glacier, a 
fossil, a cloud, or a rock formation), they will be separated into two groups, 
those who answer “cloud” and those who give any other answer, “fossil,” 
“glacier,” “rock formation,” or no answer at all. 

The rationale that permits us to obtain a measurement of a many-degreed 
or “continuous” dimension through a series ot twofold classifications is made 
graphic in Figure 12. In these illustrations, notice that many degrees of varia- 
tion are possible relative to the lines on the wall (representing height) or to 
the grid (representing knowledge of geographic names) but that only two 
degrees of variation are possible for each line or each section of the grid. A 
number that represents the height of each of the six boys is obtained by count- 

® In the measurement of personality traits and structure and, in some cases, achieve- 
ment and intelligence, item responses may be used for tripart or even four-part classifica- 
tion. 


ri! 



GUIDED RESPONSE PROCEDURES 


95 



Bo> IS ^ 2 /i— 4 C — ■> (i , etc Their height c in be computed b> 

nuking a }cs oi no cl sshication ioi each \uth respect to each lirie on the 
will thus / IS \n foj S 1" and S 2", hut ro for the otheis B is yes lor 
S I ' S 2 3", and 5 4', but no for the othci , C is -vu for 5 1" through 
S 6", but /lo foi the b dance etc until / u for all but 6'1" 


I* f wU a jf cf Ot j I. f" r Nami 


1 

7 

13 


25B 

31 

A 

2 




M 

32 

3 

1 


m 

m 


4 



P 

M 


5 

p 

S'.'S 




i 

6 

12 

18 

£ 

30 

36 


So 16 icpiLsciits /t s knowledge of 

‘^C'ltS 


Let cich of the 36 sections of this 
grid icprescnt knowledge of one geo- 
giaphical name, and the arei covered 
bv /4 B ind C icspectivcly indicate 
the extent to which thice pupils knov\ 
these naniLs The area and shape of 
A B, and C could have been deter 
mined bv nuking a \es oi no chssifi- 
cation fo; t ich with lespect to each 
of the 36 sections A is >< s to 8-29 or 
16 sections, but no to 1-7 and 30- 
36 B is VO to 21, 22 26, 27, 28, 33 
34 ind 35 OI 8 sections, and no to 
1-20 23 24, 2^, 29, 30, 31, 32, and 
36 C IS vts only to 15, 16, 21, and 22 
or 4 sections, and no lo all the lest 
names 8 repicsents Bs and 4 repre 


figure 12 McisuJcment of unv dcgiccd dimensions by i senes of dichotomous 
classihct lions 

ing all the inch lines touched by each bo> Similarly, a number representing the 
geogiaphic knowledge of each pupil ma> be lound by counting all the name 
squares that each pupil covers I or this process to be \alid, it is necessary to 

The rationale underlying tests employing items that make more than twofold clas- 
sifications is simply an extension of the ration ilc. for twofold items 



96 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


assume that height and knowledge of geographic names may be defined as a 
group of elements each of which may be possessed or not possessed and all 
of which have equivalent importance. 

In our example, then, the geography teacher’s task is to make up items 
which, if answered correctly, will classify the pupil as possessing given ele- 
ments of a given dimension, and, if answered incorrectly, will classify the pupil 
as not possessing those given elements. As a source of the items, the teacher 
has available the books, the syllabus, the lecture notes containing the geo- 
graphic names, facts, and generalizations that arc the elements of the first 
four of his dimensions. In addition, he may need to use pupil products and his 
recollection of pupil discussions and reports as sources of items having to do 
with applications and errors. 

The construction of items should be governed by the following general 
considerations and rules. The advantages and disadvantages of items of differ- 
ent types will be discussed later and specifications given for each type. 

FORM 

As a ride, guided response items should be short, affirmative, simple and, 
of course, unequivocal. They need to be short so that the significance of items 
may be grasped by slow thinkers and poor readers as well as by fast thinkers 
and good readers. If the idea to which the item refers is complex or if intel- 
ligence or reading ability is being measured, this rule may need to be excepted. 
Negative statements usually hinge upon a single word, not, no, can't, didn't, etc. 
If the reader fails to notice the word, he will fail to read the item properly. 
In an affirmative construction, on the other hand, meaning is not so dependent 
on a single word. If a negative construction has to be used, the negative woid 
should be underlined. 

hems whose syntax is simple arc preferable to those whose syntax is com- 
plicated, Unless facility in reading complex sentences is being tested, the cor- 
rectness of a pupil’s response should not depend on his ability to solve a gram- 
matical puzzle. Finally, it is a truism in guided response testing that each item 
and each response option must have just one meaning Ambiguity of items is 
a principal cause of unreliability in tests. 

LANGUAGE 

Items should he couched in pupil language. A pupil’s knowledge consists 
of what he thinks and says and writes. IJis attitudes consist of feelings toward 
things and ideas as he conceives of these things and ideas. When wc measure 
these by a guided response technique, we assume that a given response to an 

^®The term “elements” is meaningful foi the guided response mcasuiemcnt of most 
dimensions of educational significante, but is a misnomer in some cases, e g., readinfi 
comprehemion, intensity, etc Here, a more appropriate phiase might be “a function of 
many single responses that occur or do not occur,” 



GUIDED RESPONSE PROCEDURES f 97 

item means that the pupil thinks or feels in that given way. Consequently, it is 
essential that both items and answer options be phrased in language like the 
language pupils use in thinking and talking. Only if we do this can we be sure 
that the pupil’s response is on the basis of what he knows or feels. If the lan- 
guage of the item is too strange, too bookish, or simply too difficult, we do not 
know the basis on which any pupil responds; it may be guess, intelligence, 
misinterpretation, etc., but we cannot assume that it is on the basis of his 
knowledge or attitudes. 

This docs not mean that guided response questions should avoid the use 
of technical terms, “big words,” and precise grammar. They should use techni- 
cal or academic language whenever this is the way pupils must write and talk if 
they know the subject. It is the unnecessary use of these that is to be avoided. 
In the following example, an item intended for an eighth-grade test but written 
sententiously is tianslated into more appropriate language. 

T F The Senate’s failure to approve the Teague of Nations constituted 

a censure of Wilson’s lelations with Congre'>sional leaders. 

T F the Senate (ailed to appiovc the League of Nations because Wilson 

had slighted some of its leaders. 

LIMLD II IMS 

Ik) not "lift"' items from text or reference books unicss the object of the 
item is to measure role memory of what has been read. Execpt feu the axioms 
and formulas of mathematics and certain “laws” or generalizations in the 
sciences, the language of pupils’ knowdedge is not likely to be ju^t like the 
language of textbooks. So if pupil language is to prevail, the use of the exact 
wording of sentences in textbooks is generally to be avoided Moreover, even 
if the language of the book is simple, “lifted” questions lend to measure rote 
memory of book statements. In particular is this true in completion or lill-in 
items in w^hich the context ol a statement determines its key words as much as 
the statement itself. To illustrate this point, consider the following fiT-in item, 
which consists of a statement taken veibatim from a World History text with 
three words omitted. 

None of the three _, , or _ especially 

encouraged the building ot civilizations. 

According to the context (religion, philosophy, or geography) a correct 
response might be Buddhism, t hristianity, Taoism. It might be Thorcau, 
Adams, Gandhi. Or it might be (what it is intended to be) highlands, forests, 
grasslands.^ ^ To be able to answer the question correctly, a pupil would have 
to remember that this exact statement occurred in the book. 

1\ C. Lane et at.. The World's History, Now York, Harcoiirl, Brace, and Co., 1954, 

p. 23. 



98 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


SCORING 

Prepare for responses to items to be given in the simplest way possible and, 
if feasible, so as to permit machine or stencil scoring. In gencial, for select-a- 
response items it is considered better to have options checked or circled than 
it is to have them written in. Time is saved and there is less likelihood of inde- 
terminate responses. Scoring is facilitated by having all responses occur in a 
straight column at the left or right margin. If an electrical scoring machine is 
available, select-an-answer items may be answered on separate machine-scor- 
ing sheets with resulting economy of time and increased accuracy in scoring. 
Standard score sheets may be obtained from the manufacturer of the machine 
or from test publishers. Stencil scoring is a mec^'^amcal equivalent of electrical 
scoring and may be used with a separate score sheet or with the test itself as 
long as options are checked, crossed out, or circled. Stencil scoring means that 
a cardboard or piece of heavy paper has had a hole punched in the position ol 
each correct response. The stencil then is laid over a student's test paper and 
those marked arc counted 

Arrangement of elements and providc-an-answer items do not as readil> 
lend themselves to a uniform response format However, an effort should be 
made to have responses to such items occur tow'ard either margin and in ap- 
proximately the same place for successive items. 

DISCRIMINATION 

Devise items that will disci iminate among all types of pupils to be tested 
and include only disci inunating items. The function of a guided response item 
IS to classify pupils relative to somcf dimension. So it follows that the item must 
discriminate among the pupils. If all pupils answer it in the same way, it docs 
not classify them and the item has measured nothing, llicretore, no items 
should be devised that are so easy that all pupils will answer them correctly, 
or so difficult that all will answer them incorrectly, when the purpose of the 
item is to measure. Jf an item is meant to give confidence to certain pupils or 
to deflate the egos of others, it may be used on motivational grounds but not 
on the grounds of measurement Responses to items of this sort should be 
excluded from total scores. 

An apparent exception may be made to this rule when there arc specific 
instructional objectives that ah pupils must achieve and to the same degree 
For example, in beginning algebra all pupils must learn to transpose terms 
(x = y may be stated x - y 0) . A test in algebra might have an item keyed 
to such an element in full expectation that all pupils in a given class would 
answer it correctly. In a larger population of pupils, however, there would be 
some (those who had not studied algebra) who would answer incorrectly. 
Hence, the item would discriminate with reference to this larger population. 

Not only must test items discriminate, they must discriminate positively. 
For achievement and intelligence tests positive discrimination means that a 



GUIDED RESPONSE PROCEDURES 


99 


greater proportion of pupils who get high scores on the test answer the item 
correctly than pupils who receive low scores on the test. A total score for a 
test must bear a direct and not an inverse relationship to any item score. Ob- 
viously, this is essential if the size of a total score is to be directly related to 
the degree to which a pupil knows or feels a given thing. The degree of positive 
discrimination for items often is used as a statistical index of their validity. 

A simple way to check items both for general discrimination and positive 
discrimination is to pick papers whose total scores represent the top 25 per 
cent and those whose scores constitute the bottom 25 per cent.^- Then for each 
Item tabulate the percentage of each of these groups who answered the item 
correctly. Figure 13 shows a section of a table resulting from this item analysis 
[)rocess. 


Hem 

Per c ent oj upper 
quarter who 
answered corret th 

1 Per cent of lower 
quarter who 
answered correct h 

23 

75 

35 

24 

50 

45 

25 

100 

100 

26 1 

1 

40 

27 1 

40 

10 


1 


Figure 13. Section of an item analysis table. 

With reference to Figure 13, all items discriminate except Number 25. li 
would be necessary to check all the papeis to determine W'helher 25 has any 
discriminating power at all. Items 23 and 27 show marked positive discrimina- 
tion. Item 24 shows slight degree of positive discrimination but Item 2b 
shows negative discrimination. Consequently, items 25 and 26 arc suspect 
until proved valid or arc revised: items 23 and 27 are clearly good items; while 
Item 24 should be inspected for flaws but may be valid. 

Item analysis is an empirical process requiring at least one administration 
of the test items. It is useful generally for tests or test items that are to be 
reused. The analysis gains significance if several administrations ol the items 
contribute to the figures. One convenient mechanical arrangement lor item 
analysis is to write each item on a separate card and to enter there a tally of 
pupil responses for each administration of the item (see Figure 14). Then, in 
immediate view of the results over a period of lime, the item may be inspected 
for discrimination and difficulty. 

While item analysis is time-consuming, it is the only way known to verify 
the validity of items. Its continuous use with a large inventory of items placed 

12 Additional statistical devices for item validation arc described in such texts as 
Bean (2) and Tate (34). 



100 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


The largest mountain range in the United States is (a) Sierra Nevadas', (b) 
Rocky Mountains, (c) Appalachian Mountains, (d) Cascade Mountains, 


Date 

Number 1 
of 

pupils 

Number 

answerifiQ 
i orrectly 

Per cent 
correct in 4th 
quarter 

Percent 
correct in 1st 

quarter 

12-56 

42 

SO 

74 

58 

12-55 

45 

35 

80 

63 


Figure J4. Illustration of a test item entered on a card together with an analysis 
of responses to the item. 

on separate cards facilitates the construction ot any particular test on the same 
subject. The necessary type and number of items may simply be selected from 
the card file, arranged in proper sequence, and typed for duplication. New 
items must, of course, be designed as those that prove to be invalid are dis- 
carded and as new dimensions or subject elements arc to be tested. 

As a rule, it may be assumed that items you think will discriminate posi- 
tively actually will do so. Technical causes of no discrimination or negative 
discrimination in items have been summarized by Lindquist (22) much as 
follows: (a) weaknesses in the item: ambiguity, clues, misleading elements, 
etc.; (b) insufficient learning of the conception to which the itcfii is keyed, so 
that a plausible false option appeals to superior pupils; ((') a decoy response 
that appeals to conflicting learning from other subjects, to conflicting preju- 
dice, or to peer opinion, with the right response appealing only to experts. 

DIFFICULTY 

If technical flaws in items or improper test conditions do not produce the 
varying degrees of discrimination observed for achievement and intelligence 
test items, it is assumed that the relative difficulty of the items produces the 
effect. Thus, degree of discrimination may be taken as an index of item dif- 
ficulty. An item which 90 per cent ol a group answered correctly would be 
considered an easy item. One which only 10 per cent answered would be 
termed very difficult. An item that half of a class answered correctly and half 
answered incorrectly is said to have 50 per cent difliculty. 

The question of just how difficult should be the items in a test has been 
studied for many years but it remains a moot point (9). Perhaps, what is an 
optimum test may not be stated without reference to a particular measurement 
purpose, and, therefore, neither may optimum item difliculty be discussed in 
the abstract. Obviously, tests used to select a small number from among a great 
number (a foreign service examination, for example) may include items of 
greater mean difficulty than tests designed simply to measure variation among 
an unsclected population (for instance, an adult group intelligence test). 





GUIDED RESPONSE PROCEDURES 


101 


A test for speed of performance should, of course, have items of the same 
difficulty, whether this is high or low. In power tests, those designed to meas- 
ure the level of a pupil’s ability or knowledge, it is usual to have items that 
increase in difficulty as the test goes on. The last question a pupil can answer 
correctly is taken to be indicative of his level of achievement. Hence, items 
should be carefully graded as to difficulty and placed in ascending order of 
difficulty. Power tests arc used widely in the measurement of intelligence, 
mathematics, vocabulary, and reading. 

In tests directed at achievement in general, particularly toward what a 
pupil knows, how much he knows, and how accurate is his knowledge, we 
advise that items be constructed without special regard to their difficulty. 
Careful attention should be given to adequate sampling of the dimensions of 
the subject. If the sampling is adequate, the items will vary in difficulty ac- 
cording to the gradations of difficulty inherent in aspects of the subject. If 
there arc no gradations of difficulty among elements of the subject, then item 
difficulty is not a relevant consideration. 

Item difficulty not only is a function of the clement to which it is keyed 
but also of the type of item. This type difference in difficulty is not usually in- 
cluded as a factor in analyses of item difficulty based on their discriminating 
power, and only items of the same type should be so compared. The inherent 
difficulty of a given type of item may be a factor in selecting that type of item 
for use; it may affect the weighting of items; but it should not be confused 
with the difficulty of the knowledge element to which the item is keyed. 

The foregoing discussion of item discrimination and difficulty has related 
to tests of ability or subject achievement. The basis for discrimination in tests 
of personality, attitudes, and interests will differ, but the need for positive dis- 
crimination in test items is as applicable to the measurement of personality 
variables as it is to achievement. Item difficulty has, of course, no relevance 
for testing in these areas. 

INDEPENDENCE 

Items should be mutually independent. The usual interpretation of scores 
on guided response tests assumes that a response to one item is not affected 
by a response to any other item. Without independence of items, the probabil- 
ity factor in guided response items becomes complex and tends to distort the 
relationship between scores and achievement. A pupil who knows the item 
that is a clue to another has an uCl\antage over the pupil who does not. 

For example, suppose that in a six-item test the probability of a correct 
response to items 2 and 3 is doubled if there has been a correct response to 
Item 1 and it is reduced to zero if there has been an incorrect response to Item 
1 . On the other hand, the probability of a correct response to items 4, 5, and 6 
is not affected by a correct response to Item 1 . Two puplis, each with an ability 
to get 50 per cent of the items correct, might make different scores just because 
of this condition, as follows; 



102 FUI^AMENTAL CONCEPTIONS AND PROCEDURES 







'’-TTTT' r 

Independent probability Probability 

Probability 


of a correct 

of Pupil A having 

of Pupil B having 


answer to any item 

correct answers if he 

correct answers if he 

Item 

for either pupil 

answers 1 incorrectly. 

answers 1 correctly. 

1 

V2 

V2 

V2 

2 

V2 

0 

1 

3 

V2 

0 

1 

4 

Vi 

^2 

Vi 

5 

V2 

Vi 

1/2 

6 


Vi 

Vi 

Most probable 

total score 

3 

2 

4 


Mutual independence does not mean that several items may not be keyed 
to the same clement. It means simply that the answer to one item should not 
be stated or suggested in the body of another or that the significance of one 
item should not be contingent upon a proper answer to another. Items 1 7 and 
35-39 below illustrate the “clue type'’ of interdependence, and items 43 and 
44 the “contingency type." 


NOTE, 'fhese items are bad examples. 

T F 17. The Columbia River empties into the oecan in the state of 
Washington. 

Name five navigable United States rivers that have industrial 
and hydroelectric importance. 

-'^ 5 . 

3h. _ 

37. 

38. 

39. 

43. What kind of clouds cause thunderstorms? _ . 

44. The usual mean depth of these clouds is 

a. 3,000 feet 

b. 6,000 feet 

c. 500 feet 
d 1,500 feet 

e. None of these 


SAMPLING 

As a group, items should sample adequately all dimensions subject to 
sampling of the phenomenon being measured. Unfortunately, there arc no 



GUIDED RESPONSE PROCEDURES 


103 


precise directions that may be set forth to insure proper sampling by items. 
We have discussed earlier the general nature of sampling and some principles 
that pertain to it (pages 38-40). These principles are as applicable to this 
sampling task as to any other. 

The number of parts and/or subdivisions involved in a dimension some- 
what govern the number of items that should be keyed to it. In illustration of 
this, consider the first dimension the teacher in our example designated for 
knowledge of geography, “What and how many geographic terms the pupil 
can identify in each basic area of geography.” If there were three basic areas, 
earth, sea, and climate, and each involved 100 terms, 25 questions might be 
keyed to each area. On the other hand, if there were 1 ,000 terms for each ol 
the areas, sampling would seem to demand more items for an equally good 
sample. If the areas were unequal as to the number ol terms they subsumed, 
the number of items keyed to each area should be somewhat proportional to 
the different number of terms in the areas. However, there is no exact rela- 
tionship between the number of elements in a dimension and the adequacy of 
given size samples. Just because 1,000 geographic terms might relate to the 
earth and 500 to the climate, a sample of 50 earth-keyed items would not 
necessarily be equivalent in adequacy to 25 climate-keyed ones. 

Aside from the “how many items” aspect of sampling, it is necessary U> 
consider what items arc needed to give an adequate sample Ihe rule, ol 
course, is that the sample exactly represents the whole, just as a reduced photo- 
graph exactly represents a larger one. This condition may be approached, first, 
by including items for each subdivision or aspect of a dimension. If the num- 
ber of components of the subdivisions is known, the number of items for each 
should, as we have stated, be proportional. 

Second, it is necessary to select the actual elements to which items arc to 
be keyed so that each element in the subdivision has equal probability ol 
being in the sample For example, if recognition ot names of navigable rivers 
is to be measured, each navigable river should have as good a chance as any 
other navigable river of being in the lest To accomplish this, it is necessar) 
to use some system of random selection Pages may be chosen in an automatic 
fashion, every fifth, tenth, etc., from text and reference books, or tacts and 
ideas may be written on slips and a sample literally drawn from a hat. 

An objection frequently raised to a random selection of elements is that 
some elements are so important that thc> must be included. If this is the case, 
then these very important items . hould constitute a separate subdivision to be 
included in toto or to be sampled separately. The device of random selection is 
intended only for and is only valid for groups of homogeneous elements of 
equivalent importance. 

An empirical procedure is available to determine when items are sufficient 
in number and nature to sample a dimension properly. Use a given number of 
items keyed to a given dimension in a test, administer the test, and note the 
results. Add items keyed to thp same dimension to the test, readminister it to 
the same group, and compare the results with the first testing. Continue this 



104 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


add-items-and-retest process until the continued addition of items produces no 
change in the relative standing of pupils on the test. The number of items you 
have at this time is all you need. The procedure, regrettably, is time-consuming 
and difficult and usually is employed o in the construction of standardized 
tests. 

Types of Items 

Select types of items for use according to their special properties and con- 
struct them according to their type specifitMions. I'hc characteristics of the 
three basic categories of guided response itlBs (selecl-an-answcr, provide-an- 
answer, arrange elements) as to their chand factor, administration time, and 
difficulty are summarized for comparative purposes in Table 2. 

Chance Factor and Administration Time, From this overview it is ap- 
parent that select-an-answcr items are the most efficient for time of admin- 
istration, but are likely to be the least reliable with their high chance factor. 
The chance factor means simply that by guessing alone a pupil would be ex- 
pected to make an indicated percentage of correct responses. With chance a 
factor in determining pupils’ scores, the reliability of those scores is reduced 
since a proportion of any score must fluctuate in a random fashion. This 
situation is analogous to the “scatter” found lor a group of shots from a rifle. 
Even though the rifle is clamped in a vise and has been “zeroed in” on a 
target, not all the shots hit the same spot, but will describe a cluster thus: 



with the position of any given shot partially determined by chance variations 
in ammunition, wind, barrel whip, etc. The larger these variations the larger 
the cluster or the greater the unreliability of the shooting. Just so, larger 
chance factors make for greater inherent unreliability in test items. 

Provide-an-answer items, on the other hand, have an advantage so far as 
inherent reliability is concerned. Chance may be considered a negligible factor. 
But they arc at a disadvantage relative to time of administration. No data arc 
available as to the administration time of arrange-elements items, but their 
inherent reliability in most cases should be intermediate between that of 
provide-an-answer and select-an-answcr types. 

Difficulty of the Three Types of Items. The relative difficulty of the three 
types of items, if subject difficulty is the same, is a function of the psychologi- 
cal activity involved in responding to them. To answer true-false, multiple- 
choice, and matching questions correctly requires merely that a right response 



Type of item * Chance factor -t- Items administer able in ten minutes ~ Relative difficulty 

Select-au-ansM'er Elementary Secondary College Least difficult bee; 

Low iO per cent Average Upper 10 per cent of appreciable 


c 


o 

0) 


a 

a 


-fl ^ ^ 3 
a t-> »*. 


a a 3 - E flj 

^ (U UE ii t- 

M 4= O .E 00 

Q 


0 

^ c c c 
U .2 .2 2 
ju a a a 

a ^ O O 


I (L> £ 

I ^ 13 (u c 

° o >• c . 

O ir Z 

^ E -n 2 §. 

E ^ 3 u (j 

Ml C CO 


105 



106 FVSpJ^MKSSTAL CONCEPTIONS AND PROCEDURES 

be recognized. This is to recalling or “sensing” that a present 

perception has been pcrdWfed previously. As we know, this represents a 
simple and easily attained learning of which even lower animals are capable. 

Provide-an-answer items are likely to be more difficult since verbal recall 
is a more exacting psychological activity than recognition. In general terms, 
“recall” means that part or all of a previous experience may be re-enacted in 
response to some fraction of the stimulational field that produced the original 
action. Higher animals are distinguished by their facility for such re-enact- 
ments and it appears later in the mental development of the human infant 
than docs facility for recognition.^-^ 

It is likely that arrangc-elements items have about as much inherent dif- 
ficulty as provide-an-answer ones.^^ Although tecognition of elements is in- 
volved rather than recall, recall of a pattern or reasoning out of a pattern is 
necessary. Moreover, since arrangement items can be keyed to very complex 
patterns of relationship, they may have a greater difficulty potentiality than 
any of the other types of items. 

In addition to the relative difficulty, reliability, and administration time of 
types of items, it is necessary to consider some of their unique characteristics, 
their usual applications, and specifications for constructing them. 

SULECT-AN-ANSWER HEMS AS A GROUP 

True-false, multiple-choice, and matching items are familiar to pupiK. 
their chance factor is exact and they are the easiest to score of any items 
Only select-an-answer items are currently amenable to machine scoring. They 
are best adapted to measuring areas of knowledge or abilit} that involve man> 
specific items of exact information or performance. Agrcc-disagrec or yes 
no-maybe variants of true-false items and the others are, in addition, use- 
ful for gross screening in the measurement of personality attributes. Sclcct- 
an-answer items have been found of less use in such subjects as art, music, 
physical education, shop, etc., and in appraising those phenomena we call 
skills (writing, drawing, studying, etc.). In subjects in which relationships and 
patterns are significant, and facts per se are less so, select-an-answer items 
have limited usefulness, philosophy, sociology, and psychology, for example. 

A frequently voiced criticism of select-an-answer items and one with which 
the authors must agree is that they are artificial in character. It is axiomatic 
in measuring behavioral phenomena that the test situation is best which ap- 
proximates most closely the actual situation in which the phenomenon being 
measured usually occurs. Now, thinking true or false to specific questions, 
mentally reviewing optional answers to a specific question and then selecting 
one, or matching up columns of associated ideas does nut describe the 

The actual distinction between lecognition and recall is consideicd to be one of 
degree rather than kind by most psychologists 

The writers are aware of no lescarch beaiing on their difficulty and the psy- 
chological activity in responding to them is not as clear-cut as it is for the other types. 



GU1D£D RESPONSE PROCEDURES 


107 


anatomy of knowledge, personality, intelligence, or anything but behavior 
during a test. 

TRUE-FALSE ITEMS 

While they are the most economical of time and space, true-false items 
are particularly susceptible to stereotyped construction and, thus, to the 
operation of syntactical clues One early study of the syntax of true-false items 
(5) found that longer items were more likely to be true, four of hvc state- 
ments containing all were false, three of four containing alwa\s and ne\cr 
were false; four of five containing no, none, or nothing were lalsc, and nine 
of ten using the word only or alone were false. For some reason, stereotyped 
expressions tend to accompany false items and other stereotyped expressions 
tend to accompany true items Pupils become habituated to these stereotypes 
and try to answer many questions on the basis of their iorm A second general 
liability of the type is that true statements must be total!} true while false ones 
may be only partially false For this reason, true state nunts not at the same 
time give-aways are sometimes very difficult to constiiui 

Whatever their liabilities though, well-designed true-false items have great 
usefulness in educational measurement Among the specifications for thtir 
construction arc the following 

1 Avoid the expressions that tend to go with lalsc items all always 
never, no, none, nothing, only and alone 

2 Make true and false items ol about the same mean length 

3 Avoid statements so abstract or so general as to be unanswerable b} a 
true or false, or yes or no 

4 Have truth or falseness be a lunction ot the basic prcdiciiion ol i 
statement, not of some minor element clause, or phrase For example 

( Right) 

r F Calvin Coolidge was famous for his rcliiLtancc to spcik at length 

(Wiong) 

r F Calvin Coolidge, elected president in 1922, w is famous lor his 

reluctance to speak at length 

In the first item, truth or falseness properly hinges on Coolidge s laeiluinit> 
and is true In the second, while the basic predication ol the statement is still 
true, the phrase “elected piesiucnl in 1922,’ is lalsc since there was no 
presidential election m 1922 Hen^e, the statement as i whole is lalsc 

5 Avoid such vague qualihers as usuall}, seldom miKh little many 
lew, large, small, etc 

6 Make approximate!} half the items truc and hall laKc and determine 
the order of true, false by chance 

7 Have enough true-false items so that the chance lactor (50 per cent) 
does not too senously restrict the range of the test or the reliability of a score 



108 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

Ten items are certainly for five could be answered correctly by 

guess alone and, thus, tbi^ dSfetual range of scores is only five. In such a re- 
stricted range chance would be a disproportionately large factor in any pupil 
score. We suggest that fifty true-false items be considered a rule-of-thumb 
minimum.^® 

MULTIPLE-CHOICE ITEMS 

Those items offering 3, 4, 5, or even 6 possible answers from which the 
testee is asked to choose (see Figure 9) are more difficult to construct and 
less economical of space than true-false items. The diiriciilty in their construc- 
tion lies chiefly in designing plausible decoys or false options. On the ad- 
vantage side, their chance factor is less than that of true-false items, and they 
can be keyed to more complex elements of knowledge or feeling than can 
true-false items. It sometimes is advantageous to combine multiple-choice 
and true-false types. For example. 


(Mark each option true or false) 

The Mississippi River affects the economy of the United States as follows: 
T F a. It is used for passenger traffic. 

T F b. Coal barges sail on it. 

T F c. It serves to irrigate Iowa and Illinois. 

T F d. It produces more hydroelectric power than any other American 
river. 

T F e. Its immediate valley Jis rich farming country. 

T F f. It is navigable as far north as Rock Island. 


The following specifications arc offered for the design of multiple-choice 
items. 

1. Four or five options arc standard practice and the use of at least four 
is advised. With three options, the chance factor is nearly as great as in true- 
false items. 

2. The number of multiple-choice items needed for reasonable reliability 
and range of scores is, of course, less than for true-false items. If we use an 
effectual score range of 25 as a rule-of-thumb minimum, we should have at 
least 33 four-option items, 31 five-option, or 30 six-option. Chance would, 
on the average, account for about 8 points, 6 points, and 5 points respectively 
of any pupil’s total score. 

3. No options may be absurd or obviously true. Absurd options reduce 

If a quiz (as on a reading passage) is used for motivational purposes only or 
just to see who have read the passage and who have not, number of items is not such 
an important consideration. 

Assuming that pupils will guess at some answers. See pages 105, 1 18. 

This assumes that pupils will guess at some answers. See pages 105, 1 18. 



GUIDED RESPONSE PROCEDURES 109 

the number of options that test anything by ifee number that are absurd. Ob- 
viously true ones permit all pupils to select the mme one and thus the item 
will not discriminate. 

4. If a negative statement is used, underline the negative word so that it 
will not be overlooked. 

5. It is advisable to have options be the responses to a question, the solu- 
tions to a problem, or the predicate of a sentence whose subject is the basis 
of the item. Such simple constructions are likely to oiler fewer contextual 
clues to the correct answer and the effect of reading skill and intelligence 
hence are minimized. 

6. All options should be grammatically consistent so that the syntax of 
the item will neither help nor hinder a correct response. 

7. Obviously, the more nearly true or plausible decoy options arc, the 
more difficult it is for the pupil to detect the correct response. Hence, the near- 
correctness or plausibility of options may be increased to measure higher 
levels of knowledge or intelligence. 

8. Correct options must not always be in the same place, nor should they 
fall into a pattern. Chance selection of the correct position will insure both 
conditions. A table for use in randomizing multiple-choice options has been 
prepared by Anderson (1). 

MATCHING ITEMS 

Matching items are the easiest of all to construct and are economical of 
space. However, their use largely is limited to associative pairs: piesidents 
and dates, wars and dates, authors and books, stories and characters, inventors 
and inventions, etc. Adherence to the following rules will insure maximum 
reliability in their construction (h'or example sec Figure 9) 

1. The best number of items in the stimulus column (things with wdiich 
others arc to be matched) has not been precisely determined by research, but 
probably is between 5 and 12. 

2. There should be more items in the response column than in the stim- 
ulus column (say three more) so that the last of a series may not be matched 
just by elimination. 

3. Stimulus and response columns should be on the same page, side by 
side, with the stimulus column to the left. 

4. Directions should clearly state which column is to be matched with 
which, and what is to be written in, letters, numbers, or words. 

5. A single matching series should involve a single subject (w^ars-dates, 
books-authors, etc.) and not several subjects If heterogeneous subjects arc 
included, correct answers may be gained by reasoning as well as by knowledge. 

6. Items in one of the two columns may be listed in some logical order, 
but items in the other must have a random sequence so that item position 
can be no clue to that which it matches. Random sequence (so lar as the 
significance of items is concerned) frequently may be obtained just by putting 
the items in alphabetical order. 



110 


fundamental conceptions and procedures 


PROVIDE-AN-ANSWLR ITEMS 

Completion, short-answer, and labeling items are illustrated in Figure 10. 
As a group, they are a more natural type of item than sclect-an-answer Their 
form more nearly approximates the way subjects arc presented, and responses 
to them are more like the actions and thoughts that we say constitute subject 
achievement Like the other category, though, they usually arc applicable 
only to subjects in which items ol exact information and/or precise state- 
ments arc impoitant Labeling items arc particularly useful in scientific and 
mechanical courses 

Ihe construction ol pro\ide-an-answer items is relatively easy, but scoring 
them is more tedious than select-an-answer They may not be machine scored 
Frequently, more than one exact answer may be correct and sonic interpreta- 
tion on the part ol the scorer is necessary 

COMPLETION OR Til I -IN IIIMS 

Completion items arc easy to prepare and can be used to measure reten- 
tion of a composite idea hor example, pupils’ knowledge ol an entire elec- 
trical concept mav be tested by omitting the undeihned words m the following 
sentence 

In clcctiicity, the resist ince ot a wire determines the amount ol cuirent that 
will How in icsfX'nsc to a given anu unt ol \olt igc 

In constructing completion items, the lollowing procvdiiiC'. arc advisable 

1 Omit impottant words or shoat phrases only 

2 Ascertain that only one word or short phrase is the coircct response 
11 this IS iinposMblc insure that only a limited number ot insertions will be 
correct and that you know each of them 

^ Do not leave so many blanks in a given statement that the statement 
loses Its meaning 

absoliilL distinction is impossible between provide an answei items, which 
wc classif> IS guided icsponse and those which wc classify as free response As short 
answer responses become Ioiilci, and as free responses become shorter, the two must 
converge, of course \s a basis foi distinguishing between the two types, we offer the 
following 

a If neccssirv, sclect-an answer items could be substituted for pro\ ide-an-answer 
ones I he> could not be substituted for frce-icsponsc Hems 

b Ihe dnsw».is to a provide an answer question will be cither iiglit or wrong, will get 
lull value or no value The answers to a free response question will, on the other hand, 
lange from those entirely wiong through those of little, of medium, anc of great value 
to those entire )v correct 

Distinguishing between the two types is not entirely an academic point Free response 
Items require a different sort of scoring key from guided response items, weighting has 
a different significance for them, and they can easily appraise aspects of knowledge that 
guided response items can tap only with great difficulty 



GUIDED RESPONSE PROCEDURES 


111 


4 Make all spaces in the same statemogl the same length, so that the 
size of space can be no clue to the length 1||l|||||^ right word ot phrase Any 
space should be large enough tor the longest word or phrase 

5 Within a single statement, omitted words or phrases should be ol 
parallel grammatical significance to prevent the operation of syntactical clues 

SHORT ANSWER HEMS 

Since responses may be arranged to occur in a vertical column, short- 
answer Items may be easier to score than the completion or fill-in type On 
the other hand, they do not lend themselves so well to testing complex ideas 
since only one response usually is made to each question or problem Olhci 
than these differences, thc\ are similar to completion items and their construe 
tion should be guided by two ot the same mixims onl} one word or phrase 
IS the correct answer, and space for responding should he no clue to the length 
of response In addition 

1 Items should be direct questions or cominands , 

What is the iormula loi sulphuiic acid^ 

or 

Give the toinnih for sulphuiic icid 

2 Questions or commands should be directed to the inijn^r^ ini aspcct(s) 
ol a tact or idea rather than to superficial oi liivi d ones is 

Clements pseudonym Maik Twiin dciiscd fiorn whit Mississippi 
steam boat task ^ 

Not 

How many feet of watci does Mirk Tw iin nit in ^ 

1 \IH I ING HEMS 

Ihcsc Items, which arc a mainstay ot industrial arts and scuirl teachers 
aic as nearly foolprool as any Moreover they ar^ applicable not only to 
knowledge ot tangible objects, but also to abstractions that ma\ be presented 
schematically Knowledge of govcinmcntal organizaoon, plant and animal 
classification, stage diiections and movements m ly be tested by labeling items 
as well as knowledge ot life cells (see Figure 10) the parts ol a rilk riveis 
ind mountain ranges, etc 

Among the special directions to follow n designing Idxlirm items arc 
these 

1 Ihe diagram to be labeled should be clearly drawn and entirely recog 
nizablc 

2 The parts to be labeled should be indicated definitely, and as well the 
place where the label is to be written 

3 Scoring often is facilitated if labels are written in a vertical column 
rather than on the face of the diagram This requires the use of arrows and 
should not be done if the diagram will become cluttered and confusing 



112 ^ FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

4. Labeling items can be converted into select-an-answer items if desir- 
able, by the inclusion of a list of labels (together with some extras as decoys) 
to be properly assigned. The item is then a special type of rnatching item. 

ARRANGEMENT-OF-FLEMENTS ITEMS 

Arrangement items (see Figure 11) have had limi.^d use because they 
measure memory of relationships and organization or the ability to reason 
out such relationships. Too, assembly items are most appiopriate to mechan- 
ical subjects in which observation and product analysis often are more effi- 
cient means of measurement. Both ordciing and assembly items arc wasteful 
of time and space, and scoring them may be difficult if partial credit is to be 
given for partially correct arrangements. 

However, the items seem to have an unusually high potential for measur- 
ing concepts of organization in many school subjects and for measuring rea- 
soning and spatial relations aspects of intelligence. As such, thc> deserve full 
consideration by any test designer. 

In addition, arrangement type items are useful in the measurement of 
personality variables. Ihc MAPS test (30), one of many that employ ar- 
rangement items, consists of many pictures from which the subject is asked 
to select a few and to arrange them in a picture story. The pictures he sclcct> 
and the way he arranges them are taken to be indicative of his personality 
structure and tensions. 

I he following principles bear on the construction of arrarfgement-of-elc- 
ments items. 

1 The type of arrangement desired should be staled clearly and be illus- 
trated if necessary. Illustrations usually are necessary for elementary pupils. 

2. Tffic place where the desired array is to be put or presented should be 
indicated definitely. 

3. Research furnishes no precise information as to how many elements 
should go into an arrangement item for given purposes. However, we know 
generally that the number of elements appropriate for the same relative degree 
of difficulty should increase fiom grade lo grade Moreover, enough elements 
should be used to minimize chance solutions. It three arc given, chance would 
produce a solution one out of six times on the average. If four arc given, the 
chance factor decreases to one in twenty-four 

4. As a rule, an arrangement item deserves special weighting if it is to be 
included in a total score along with items requiring less thought and time. 

Composition of the Test 

Assemble items and affix directions according to the inte ided use of the 
instrument and certain tenets oi test format. While valid items may have been 
devised that sample adequately all dimensions of the phenomenon being 
measured, it still remains to assemble these into a “test.” The characteristics 
of the test as a whole arc of equal importance with the characteristics of the 
items in determining the efficiency of the instrument. Improper length, se- 



GUIDED RESPONSE PROCEDURES 113 

quence of items, directions, scoring key, etc., can invalidate the test results. 
On the other hand, proper length, sequence, directions, and scoring key will 
contribute validitv and reliability over and above that inherent in the items. 

LENGTH 

Two primary cor Mdcraiions are involved in the length of a test: optimum 
administration time and reliability. From Table 2, it is apparent that certain 
types of item may be administered more rapidly than others. Hence, for a 
given administration time, a true-false test may be relatively long and a short- 
answer test relatively short. 

Administrathc Considerations Secondary school periods vary in length 
but fifty minutes probably is close to the median. It follows that a lest de- 
signed for a secondary grade ordinarily should be administrablc in such a 
space of time. If we allow five minutes to begin the clas^^, and five to close it, 
maximum test time for a fifty-minute period is forty minutes. If a longer lest 
is necessary than may be taken in one period, the test may be split into self- 
contained halves and administered on successive days. In elementary grades, 
available periods of time may be longer but the attention span of children 
certainly is shorter. Thus, for administrative elliciency, optimum lest length at 
the elemental^' level should be no longer than at the secondary and in most 
cases somewhat shorter. 

It IS necessary to consider pupil fatigue and interest in setting the length 
of a tc^t as well as the time available. Unfortunately, the varied nature of 
pupils and of tests precludes any prescriptive slalemcnis as to fatigue and 
interest. Ihe individual teacher simply must judge how long his pupils can 
maintain a good test set for a particular subject at a particular grade level. 
In general we know that older children usually can sustain a good test set 
for a longer period than younger children. A variety of types of test item will 
permit longer sustained attention than a single type. Brighter and better 
inlormed students can be tested for a longer period with good rapport than 
can slower or less informed. 

Obviously, if all pupils arc supposed to finish a test, its length should be 
geared to the performance of the skiwcst pupils. If the test is to measure 
speed of performance, its length should be such that the fastest pupils cannot 
complete it in the allotted time. 1 he rate of the swiftest pupil may be meas- 
ured only if a rate exceeding his is possible for the lest. 

Reliability Considerations. ,As for the reliability consideration in test 
length, select-an-answ'cr tests achieve more reliability as they become longer 
simply because the chance factor in responding becomes relatively less signifi- 
cant. Moreover, longer tests comprised of any type of item provide a larger 
sample of the dimensions being measured than do shorter tests and, hence, 
are likely to be the more reliable. An analogy may serve to explain the im- 
portance of test length for test reliability. 

Suppose that a nurse was directed to obtain a patient’s “true” blood pres- 
sure. One use of a sphygmometer (the familiar tape, bulb, and meter) would 



114 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


be subject to chance factors injhe nurse’s readings and to situation variables 
in the patient’s blood preil(pl (excitement, state of digestion, fatigue, etc.). 
A second application would be subject to similar chance and situation varia- 
tions, as would a third, fourth, and so on. But an average of two readings 
should be more reliable than either alone, since it is likely that the chance 
and situation variables would not make for errors in the same direction each 
time. An average of three could be trusted even more, and so on, until an 
average of some large number of blood pressure readings could be treated as 
the patient’s true blood picssure: true because, over the long run, chance 
and situation variations would occur in opposite directions and in equal mag- 
nitude and, thus, could be assumed to cancel out. Let each separate blood 
pressure measurement represent one test itei i, the total number of blood 
pressure measurements represent the total number of items in the test, and 
the average blood pressure reading represent a pupil's score on the test. Ob- 
viously, the gi eater the number of items, the more reliable is the test score. 

Formulas have been devised to determine the number of additional items 
needed to raise the reliability of a test by a given amount.^*' These mathe- 
matical calculations arc applicable only if the items to be added have relia- 
bility equivalent to that of items present. It is well to know that the relation- 
ship between length and reliability is one of diminishing retunis Double the 
length of a very unreliable test and you may have a substantial increase in 
reliability; but double the length of a test whose reliability already is very high 
and your gain is slight. 

irtM snQVL^c^: 

In general, the sequence of items should be such as to engender a favorable 
test altitude, and to provide no clues to correct or desired responses. Given 
types of items should be grouped together (i.c., all multiple choice, then all 
true-laisc, etc . ) both lor clarity of directions and for ease of pupil response. 
As we have stated earlier, no pattern of right answers should be detectable. 
As a rule, this absence of pattern may be accomplished by establishing the 
sequence among items of a type in some chance fashion. 

If several items are to relate to a subdivision of the subject, it is debatable 
whether these should be grouped together or scattered throughout the test. 
There is a greater possibility of contextual clues to correct responses if they 
are grouped together, but such grouping is logical and probably makes for 
better test rapport. 

such IS the Speannan^Biown prophecy formula. 

'' J4(^V-l)r„ 

in which :=r the desired cocfTicient of reliability, _r the present coeflicient of relia- 
bility, and N — the proportionate increase in length necessary to obtain the desired 
reliability. The meaning of a cocfTicicnt of reliability is explained on pages 184-186. 



GUIDLD RLSHONSl PROCLDLRES 


115 


DISTRlBUllON AND RANGE OF DIIFICULTY OF 

It IS important m tests ot achievement that some items poor achievers 
arc likely to kno\v occur among the first ten or so items of the lest and that 
easy and difficult items are distributed evenly throughout the test The frustra- 
tion to be tound in reading item after item alter item that he can’t answer 
correctly may cause a poor pupil to stop tr\ing An even distribution ot easy 
and diliieult items thioughout the test docs not hold for so-called power tests 
These instruments arc dcsit^ned to be progressively more difficult and thus to 
measure a pupil’s pcriormance by the last item he can answer 

Lxeept lor power and speed tests, it is thought that the items ot a test 
should havL an average difliculty of about 50 per cent As many should have 
greater than 50 pci cent difficulty as have less than 50 per cent diOieulty, or 
all should have 50 per cent dillieulty The advantage in this condition is that 
median or mean sec»res (see p 140) then will tall about halfway betwem the 
letisi probable score and ihs maximum possible score With the median and 
mean at such a mid-pomt ol the possible range, pupil scores then have the 
greatest freedom to assume a '‘natural ’ distribution — not to bunch up cither 
toward high scores or low scores simply because ol the nature of the test 

It IS appreciated that this proposal ma> be contrary to practice in many 
elassiooms lest^ often aie prepared so that 70 per cent is to be the passing 
mark \ is to be 95 100 pe^ cent and so cm I his means that the average 
Item difficulty iisualh will be m the nei^ hbtnhoocl of SO per cent r ither than 
50 ptr Cent rile use ot an arbitiarv passing mark ( 70 per Lent oi some other) 
i> discussed in C haptei p u^e 194 1 here it will be seen that the practice 
has question iblc aspects md m ly be inconsistent with valid evaluation as wt 
conceive it So the advice icmiins compose tests of items whose mean dil 
lieultv IS about 50 per Scoics derived from such tests mu be converted 

c isilv into any convcntK)n il system of l< tier mirks 

IIMI D HSIS 

limed tests mn be cijiplovcd to mcasiiu rate )f performance — hesw fast 
i pupil can re id tvpe etc and lor this employment thev present no partic 
iilar problems Since speed is being measured the pressure ol time is justified 
If Irequencv ol enoi mere iscs for certain pupils under this picssurc scores 
ire still y did 

In some instances however, rate of performance is not in qucsticm and 
yet timed tests arc used Ihc reasons for such iisc are various to get the 
lest oyer in a short period of lime to maintain standard t^st conditions for 
several classes to spur pupils to work moic rapidly, and to prevent hesitant 

In mciSLinng pcrson^litV attributes many other faclois th m these ire iruolvcd 
in item sequence hut then considei dtion is beyond tht s^opc of this text Sec Cjordon 
i i6) foi one intcicstinr comment uv 



116 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


pupils from persisting in fruitless deliberation over how to respond. Whatever 
the reason, when a stop-watch is held on pupils and they know that time will 
be called whether they have finished or not, some pupils arc going to respond 
with anxiety or even panic. This emotional variable will, of course, decrease 
the scores of certain pupils by unpredictable amounts. Hence, unless speed 
of performance is being measured, it is thought that timed tests should be 
used sparingly. When they are administered, special attention should be given 
to rapport and a close watch kept for symptoms of distress. If it is apparent 
that a pupil’s performance has sulTered because of the time factor (and rate 
of performance is run in question) his test score might be disregarded and his 
achievement measured by another means. 

DIRLPTIONS 

An axiom to follow in preparing test directions is that they should be clear 
and entirely mcaninglul to the least apt pupil who will take the test. If the 
test is to be reused, they should appear on the face of the test. If the test is 
for one administration only, they may be given orally and/or written on the 
blackboard. The best standards of clear and simple exposition should prevail 
in directions and, in addition, attention should be given to the following 
specifics. 

1. Provide a place lor pupils’ names and other identifying data as desired, 
and orally instruct pupils to write their names on the tests. 

2. Give directions for responding to dillercnl t)pes of itenw; at the time 
each dillcrent type is encountered. 

3. If the pupils arc not thoroughly familiar with a given type of item, 
l^ive one or more examples (if how fo answer. 

4. If the test is more than one page, direct the pupils to proceed to the 
next page at the bottom of the page. It pupils should wail for a signal to turn 
the page, place this direction at the bottom of the page 

SCORING KIY 

T he simplest scoring key is merely a copy of the test w'ith right answers 
marked or written in. Biriciency is gained in scoring if a strip of right answers 
is used rather tlian the whole test page, or if a scoring slenLil is employed. 
T hese two types ol key are illustrated in Figure 1 5 Use of a stencil requires 
that responses be made by marking letters, numbers, or spaces that occupy 
different positions on the page. I he actual key tor a machine-scored test is 
prepared by the operator of the machine, but he must be furnished a machine 
answer sheet with correct options marked, 

GUESSING 

We have seen that items of the sclect-an-answcr type have an appreciable 
chance factor, and consequently the is'^iic of '‘guessing” is raised whenever 
tests arc composed of true-false, nuiliiplc-choicc items, and the like. There 



GUIDED RESPONSE PROCEDURES 


117 



F iqiire M Fxiniplt ot i slnp kc\ aiul i slcnci’ F(.\ 


iirc Jii cMstvciKc scvcial loinuiLis for LOircction for guessing ' Moreover 
some current writers on edueaiioniil ineasuicrnenl insist that guessing should 
he prohibited b> directions and pcnali/cd if it oceuis Cook, sunmianzing 
rescan h in ashicvcinem Icstin » and stating com. fusions ba*- d on his own 
research (9 147S) asserts that pioliibilion of guessing tends to irurease the 


\inong tliL su)iing fomiulas uswd to eontit foi gULSsing arc the folIovMng 
hor lnu false it ms 

Sloic right ing 
Scoic lotil — 2 (wiong) 

Sloh. total ^ (wiong) omitted 

I ar faiir ( hold multiple elume items 

WlOllg 

Scoit right 


Score tot d 


2 f wronc) 


2 (wrong) omitted 
Score _ total ^ 


For olh».i number of options, and for matching items, substitute in the denominator 
the number of options less one These tormul is a sume that wrong answers aie the 
result of guessing Hence, wrong responses are penalized more thin omissions 




118 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


validity of a test and dep^ssesJ^ reliability slightly, if at all. For the prohibi- 
tion to affect validity and MSlIlJity most favorably, it is essential that the time 
limit of the test approach an optimum figure. 

It is necessary to agree that guessing may lessen the validity ol a test. If 
guessing could actually be eliminated and, in the process, no other irrational 
factors were introduced, we would concur in the directions against guessing 
and the corrections lor it. However, it is our opinion, in which several other 
writers may concur (21, 29, 33), that it cannot be eliminated and that cor- 
rection formulas have little leal value Directions not to guess and penalties tor 
guessing are perhaps no more important than personality attributes in deter- 
mining whether a pupil will or will not guess Moreover, the distinction be- 
tween a reasoned judgment and a guess is often blurred, even for the pupil. 
Consequently, wc advise that neither donVguess directions nor scoring cor- 
rections be used. 

When directed not to guess, it is probable that timid and submissive pupils 
pay more attention to the directions than aggressive and dominant ones 
Moreover, good but timid pupils arc likely to omit a large number of answers 
because they arc not exactly sure, while poor but aggressive pupils are likcl> 
to answer a disproportionately large number that should be omitted and fur- 
ther depress their scoic I hus, directions not to guess are likclv to impose 
rather severe penalties on both the good timid pupil and the rash poor one 

If, on the other hand, no directions aic given one Wti}/ oi another about 
guessing, personality ditlercnces are not so likely to be aLCcntifated And, li 
pupils’ scores are simply the number they get correct, personality differences 
are not going to have a magnified elleci on the score. With or without use ol 
a correction lormula, the chance factor in any score derived from select an- 
answer items can be exactly calculated 


ADMINISTRATION OF GUIDED RESPONSE TESTS 

Administering a measuring instrument to pupils icciuiies care and skill no 
matter w'hat sort of device it is. Guided response test'', however, lay the great- 
est claim to standardized lesults and, hence, arc the most dependent of all 
on proper administration. A test essaying to measure the rieht dimensions 
and properly composed of items valid for these dimensions can still produce 
erratic scores through errois in administration — distractions, cribbing, class 
tension, and the like. Only with efficient administration can tests achieve the 
reliability and validity inherent in them. 

Common sense and the work-a-day experience of many tea ‘htrs and psy- 
chometrists are the source of rules and advice for test administration rather 
than experimental research. However, there is essential agreement among 
authorities relative to principles of good administration, and many of them arc 
self-evident. 



GUIDED RESPONSE PROCEDURES 


119 


Equivalent Opportunity 

A fundamental rule in group testing is that each pupil should have equiva- 
lent opportunity to be properly measured. None should have their scores 
alTcctcd negatively by administration variables nor should any be especially 
favored by the manner of administration 

If any directions are given orally (and some nearly always arc) children 
who hear imperfectly should be scaled close to the front of the room and on 
the side that favors their best car. Pupils with faulty vision imcorrccted by 
glasses need, of course, to be placed where they can best see the blackboard 
if any directions are to be written there. If the testing room has uneven light- 
ing, pupils with poor e\csight should be seated where the light is best 

Sufficient Materials 

It goes without saying that there should be on hand when testing begins 
sufficient test blanks, answer sheets, scratch paper, and whatever other ma- 
terials are to be used. Since pupiK will break their pencil leads or will fail 
to have pencils at all, a supply of sharpened pencils is a requirement. 1 his 
will eliminate the need for pencil borrowing and for excursions to the pencil 
sharpener, all of which arc unnecessary distractions 

Proper Facilities 

1 he ordinaries ol instruction seem to demand good lighting, good seating, 
good ventilation, etc. hor the administration of a test, it is particularly essen- 
tial that lacilities be the best possible. Often a single test period has greater 
significance for a pupifs progress and handling than many insinielional 
periods. I hirty fool- candies of diffused light on each pupil's writing surface 
IS a widely accepted standard tor the type of reading and writing usually in- 
volved in tests (19). A room temperature of 68"" to 72 with some circula- 
tion of air and low humidity is likely to make pupils feel most comfortable. 
Seating should be spaced .so that no pupil i^ tempted to copy from his neigh- 
bor’s paper If possible, a chair or desk space should be between each pupil 
and every other 

Directions 

Direction lor responding to a lest sh(>uld, as a rule, he written on test 
blanks or on the blackboard. When they are particularly complicated or wlien 
they are addressed to elementary grade children, it is advisable to give them 
orally as well. As a precaution against inattention, any oral direction should 
be repeated at least once, and opportunity should be given for questions be- 
fore testing begins. 

It goes without saying that the test administrator must be thoroughly 
familiar with the directions himself. For self-devised tests this is assured. With 
standardized tests and other imported instruments, it is well to read the test 



12D FUNDA]V<tNTAL CONCEPTIONS AND PROCEDURES 

manual completely, to read th<^irections on the face of the test several times 
and, best, to take the test yoiinplf before you administer it to pupils. 

A timed test is effective qily if time limits arc strictly obeyed. It is most 
efficient to use a timer or a stop-watch. Lacking either of these, a watch or 
clock with clearly marked minutes and a sweep-second hand may be em- 
ployed. Do not depend on,^all wrist watches with “artistic” dials for tests 
where short intervals of time are involved When the pciiod involved is rela- 
tively long, there is, of course, less need for minute and second precision. 

Quiet 

Other things being equal, proper test administration requires quiet. Not 
only should pupil-generated noise be climinaleL but noises from outside the 
test room should be reduced to a minimum Shouts and cries from the play- 
ground, the bustle and babble of pupils passing in the hall, the clanging of 
lockers, hells ringing, and the othei usual noises of a modern public school 
all serve to disturb pupils and to reduce the validity of measurement. 

Humor 

A test situation, of all instructional situations, is most likely to be filled 
with frowns, perplexities, frustrations, fears, and many other depressing ele- 
ments. Ihcsc factors do not make tor better test performance and they may 
make lor poorer lest perlormanee. Consequently, a teacher should led obliged 
to ollsel the overscnoLisncss of many pupils and their resultant tension through 
smiles, a pleasant and cordial voice, courtesy, and a relaxed manner. Humor 
properly chosen and properly timed is a legitimate and prized ingredient in 
the administration of a lest. This hUnior is thought best if it is incidental and 
deft. Moreover, laughtei that would distract pupils from their test task should 
be avoided. Obviously, any jokes or witticisms should be impersonal and 
positively toned Derogatoiy humor must be considered completely taboo 
during the administration of a test. 

Handling Upset Pupils 

Despite your best cflorls, it is likely that some pupils at some times are 
going to become frightened, be nauseated, get di/zy, cry, and even faint dur- 
ing a test. Be alert to warnings ol distress and act to relieve the pupil before his 
distress becomes acute Flushing or loss of color in the face, hand wringing, 
heavy breathing, taut muscles in the neck arc some of the symptoms of test 
malaise. 

A number of simple devices arc available to reduce the tension making 
the pupil ill. Casual assurance, either by word or gesture, that he is doing all 
right is all that some pupils need. It may be appropriate to ask the pupil how 
he is getting along and to re-explain sonic direction. Other pupils may benefit 
from your identification with their distress (“If I had to lake this test, Fd be 



GUIDED RESPONSE PROCEDURES 


121 


excited too,” or "‘When I was in school I use^to turn green too but J always 
did better than J expected.”). 

In cases judged to be extreme, it may be advisable to tell the pupil that if 
he does poorly you will give him another cmince when he feels better. It 
never is advisable to tell a pupil that he isnt frightened, that he shoiddnU be 
upset, or otherwise to belittle his plight or to anj^igonize him. lest anxiety is 
bad enough without the additional irritation of a teacher’s intolerance. 

Should preventive measures fail and a pupil’s distress be extreme, he 
should be allowed to quit the test and possibly to leave the room. In doing 
this, assurance of understanding and sympathy should be given. I he pupil 
should be told that another opportunity to take the test will be given him or 
that an alternate measuring procedure will be used. To quit a test because 
of emotional distress necessa ily is an ego-dcflaling event. If, in addition, a 
pupil feels that he will fail or miss some important academic target because 
of his distress, he is doubly injured. 


STANDARDIZED TESTS 

F"or perhaps fifty years now, American teachers have been able to use 
educational measuring instruments devised by others and claiming some de- 
gree of scientific validity. Binct and Simon issued their first intelligence scale 
in 1905. Earlier, Rice had used standard spelling tests in an experiment, and 
slightly later, after Thorndike had published a book on mental mc,isurement, 
came such pioneering tests as the Courtis Arithmetic Computation Test 
( 1909), the Ayres Handwriting Scale (191 1 ), and the Hillcgas Composition 
Scale (1912). The Army Alpha and Beta tests of intelligence were applied 
to Armed Forces draftees in 1917 and 191(S. in a colos^al program of mass 
testing, the findings and techniques of which served as data and points of 
controversy for years to come. The J92(fs were, perhaps, the “golden years” 
of the standardized test. Score ut)on score of new lilies appeared on the mar- 
ket; the use of standardized tests seemed to many to be a panacea for all 
educational ills; and, regrettably, too many tests were devised, used, and their 
results applied without due regard to the tenets of valid measurement. 

Now', in the ]950’s, users of standardized tests have a lar more critical 
attitude toward the tests and, because of the inclusion of principles and tech- 
niques of measurement in teacher training curriculiims, they are used with 
far greater understanding and skill. Moreover, research in test construction 
has provided test designers with eonstrucTion and validation techniques un- 
known to their predecessors. 

At the middle of the century, the publication of educational and psy- 
chological tests continues to increase. Oscar Burns’ Fourth Mental Meamre- 
ments Yearbook, 1953 (6) lists 793 titles, whereas the previous Yearbook 



122 


fundamental CONCEPnONS AND PROCEDURES 


published three years listed only 663 tests In this most complete 

and authoritative of all tMBjjlBl^graphi^^ the tests of 152 difleient publishers 
are reviewed Nearly an equal number of test publishers are listed, none ol 
whose tests happen to be described Ot the publishers, eight major firms -*■ 
seem to produce the majority ol tests that enjoy widespread school u^e A 
great many noncommercial agencies print and distribute standardized tests 
college and unncrsity presses Ihe armed forces, state depaitments ol cduca 
lion, prolcssional organizations and many large school districts 

Characteristics of Standardized Tests 

Just the printing of a test and Us general distribution does not make it a 
standardized test lo dcscr\c to be called “stc ndardized," a test must meet 
three critical conditions It must ha\c been preadministered to a population 
with known characteristics It must have been revised oi at least viewed 
critically m the light of the results ol this preadmmistration I rom the pre 
administration, a tabic ot raw scores matched with correlative d^nved scores 
must be prepared and available to any user ol the teU I hese derived scores 
called norms usually are percentile ranks or standard scores In the case ol 
some tests ot personality attributes they may be classifications ot interest 
attitude or neurotic tendency (See ( haptci 7 paees 1S5 160 lor a lull dis 
cussion of derived scores ) 

“Standardized ' m test construction means about what jl docs m la^ 
tones and laboratories The instrument has been jnepared under known and 
controlled conditions, the results ol its u-^e arc predict ibic and the signific incc 
oi any measure it viclds is known m ad\ancL 

Obvioush then many of the t^Csts listed m Buros in other bihlioerapliies 
and in publishers' catalogues arc not standardized tests Almost insarKbiv i 
reputable standardized test has in aceompm^inr manu li to explain the test 
and to present its norms A simple and |>rcllmlmr^ wa\ to determine whctlur 
a test actually has been standardized oi not is to note the presence or absence 
ol a manual 

Sources of Standardized Tests 

Ihe volume just cited Mental Miusiucnioits ) taiboifk is the largest 
and most critical comprehensive hiblioerapln ot standardized tests All the 
large test publishers issue eataloinics as do man\ ol the minor ones Periodi- 
cals such as I dmational and rs\( holo^ical Mt tisnn/nc ni and the Joinnal of 
Consulting Ps\clioloit} publish notices and rc\icws ol new tests 

Specimen sets ot neaih all tests arc available at nominal cost In some 

— of 1 due itional Kcscaieh State Uni\<-isit\ of lo\v i Iowa t ity I diicationil 

Test Bureau, I ducational Publishers Inc, Minneapolis, Minn rducationil Testing 
Service, Princeton, N J Houghton Mitflin ( o, Boston Miss, Ps>cliological Coipoia 
tion. New York Public School Publishing C o Bloomington, III Science Research 
Associates Chicago 111 Woild Book Co, Yonkers, N S 



GUIDED RESPONSE PROCEDURES 


123 


cases, the endorsement of a principal or other school official may be required 
before the specimen will be transmitted. It if not mandatory, to 

examine a specimen before ordering a quantity. 

The expense of standardized tests vanes greatly, from the rather high cost 
of many clinical instruments ($16 for the Binct, $60 for the Arthur) to the 
15 OF 20 cents a cop) for some group achievement tests. The chief expense 
of standardized tests is in the test booklets and manuals. Separate answer 
sheets may cost onl> 2 to 6 cents a copy. 

Merits and Shortcomings of Standardized Tests 

1 he chief value of a standardized lest over a nonstandardized one lies in 
the added care with which it usually has been picparcd. in its empirically 
tested reliability and validity, and in the lad that it may yield scores ot general 
rather than purely local significance. On the debit side, a standardized test 
may not be addressed piccisely to the dimensions you would like to pleasure. 
Its norms may be inappropriate to your pupils. And knowledge that a stand- 
ardized test covering given content is to be used may have a restrictive elTect 
on a course of study. 

Now, of course, if a teacher or an official or committee in a school district 
can design and standardize its own test, it may be possible to achieve all the 
advantages of the standardized test and yet incur none ot its liabilities The 
standardization ot a measuring instrument is a tedious and time-consuming 
occupation but it is an entirely possible and practical undertaking lor a school 
or a teacher. Ihe procedures to use arc relatively sinq'lc and aie tully de- 
scribed in many texts (2, 24, 37) Those relalmg to item construction and 
revision and to determination ot reliability are presented in this volume 
While more knowledge ot statistics is required than might be gamed from a 
course in measurement or even an introductory course in statistics, test stand- 
ardization is not a task tliat requires a professional statistician 

In only one area is the use of a standardized test mandatory, this being 
intelligence. A measurement of intelligence is hardly worthwhile unless gen- 
eral significance can be claimed for it Moreover, to devise a reasonably valid 
test of intelligence, let alone standardize it, requires training and experience 
beyond that possessed by the usual teacher, administratoi, or even psychom 
ctrist. Jn assessments of occupational interest, of specialized aptitude (music, 
mechanics, etc.), and of personality patterns, it is usually advisable to employ 
standardized tests rather than scif-devised ones. However, the available 
standardized tests of phenomena arc likely to be far less precl^e than 

those available lor intelligence. 

Judging Standardized Tests 

The worth of a standardized test may be judged on the same basis as any 
other measuring procedure: reliability, validity, and efficiency (see pages 
40-44) These may be assessed prior to first use only by reading what the 



124 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

publisher has to say about his test in catalogue and manual and what the 
reviewers have to say. As a rule, completeness of information in a manual is 
in itself an evidence of worth. Absence of information may be construed to 
mean that a given factor has not been assessed (which is bad) or that an 
adverse finding has been withheld (which is worse). 

INFORMATION OF IMFORFANC r 

Consequently, as a first criterion, detailed and precise information about 
reliability, validit>, and elficiency should be axailable for the standardized 
test in question. Bearing on these are the nature and size of the standardiza- 
tion population, the techniques employed in item construction and revision, 
and the dimensions that the test purports to me 'sure; so these too should be 
fully described. 

RELIABILH Y 

The reliability ot a standardized lest may be staled mathematically as a 
coefficient of reliability (see page 185), but the degree of reliability necessary 
depends upon the use of the test and the reliability of like tests For appraisals 
of groups of pupds and decisions about groups, lower reliabilities are ac- 
ceptable. F\ir measurements of individuals and important decisions affecting 
them, higher reliabilities are desired Many intelligence and achievement tests 
have coefiicients of reliability of ^10 to .95, but the reported reliabilities of 
personality tests tend to run in the 80\ 

VALiori V 

Validity sometimes is stated mathematically, as a derivative from a co- 
efficient of reliability and, or as a coelficicnt of correlation between the test 
and some criterion. If a suitable criterion exists, a validity coelficicnt relative 
to it has some meaning but oltentimes the criterion has no more pretense to 
validity than the test wilh which it is compared. For intelligence tests, com- 
parisons between the given lest and a battery of other tests and/or the 
Wcchsler or Binet are important fur validity. Achievement tests should be 
expected to correlate to a high degree w'ith a composite of many other repu- 
table achievement tests covering the same subjects. Moreover, they should 
show a high positive correlation with school marks in the subjects in question. 
In the writers’ view% validitv coefficients derived from reliability coefficients 
still are reliability coefficients and deserve no separate attention. 

Very often, no mathematical expression can be given to a test’s validity, 
and the case for it rests upon a discussion of its construction and the data 
resulting from its administration. In weighing this discussion, aPention should 
be given to any analysis of item discrimination and difficulty, to the relation- 
ship between test scores and age and school grade, and to the shape of raw 
score distributions (sec pages 98-101, 186). Even if a coefficient of validity is 
presented, a description of these points is necessary. 



GUIDED RESPONSE PROCEDURES 


125 


A test’s validity is not just an inherent factor; it relates necessarily to the 
use to which the test is to be put, and to the similarity between the population 
used for standardization and the pupils to whom the test now is to be applied. 
Obviously, the test will have validity for your purpose somewhat to the degree 
that there is agreement between the dimensions that the test purports to 
measure and the dimensions you wish to measure. And the raw scores of 
your pupils may be converted into valid derived scores using the test's norms 
only to the degree that your pupils possess the same essential characteristics as 
the standardization population upon which the norms aie based. Conversion 
of a raw score on a standardized test into a percentile or a standard score 
from the test’s table of norms simply gives the pupil a position in the stand- 
ardization group. This position is meaningful only when the pupil is within 
the same age range, has had about the same experience, possesses much the 
same motivation, etc., as the pupils upon whom the norms were based. If 
these similarities are missing, we may not be sure n/iat caused a pupil's score 
and we may not, with safety, go beyond the simple tact that he made a rela- 
tively high, low, or intermediate raw score on the test. 

The efficiency of a standardized lest involves its administration time, the 
ease with which it may be administered, how it may be scored, and its cost. 
Such factors usually are described in manuals and even in catalogue entries. 
To judge the elhciency of a given lest for your use rcquiies merely that these 
factors be weighed against the available funds, time and facilities. 


Some Special Considerations in Using Standardized Tests 

Many tests that cover se\cral dimensions or subject aspects (such as 
spatial perception, reasoning, etc., in intelligence; vocabulary, rate, and com- 
prehension in reading) provide for the recording of part scores and for their 
representation on a graph. 1 hesc score graphs arc called profiles and one is 
illustnited in Figuic 16 Profiles permit easy visualization of the variations in 



biguie 16. A lest profile form, i 




126 ^JNDA»1^NTAL CONCEPTIONS AND PROCEDURES 

a pupil’s achievement, intellectual facilities, and they facilitate 

the diagnostic use of mcasulMPms. 

As all elementary teachers know, and as all of us should remember, there 
are standardized tests that measure achievement in several school subjects. 
Buros lists twenty-six such “Achievement Batteries.” Nearly all c f them cover 
reading, vocabulary, and arithmetic. To these, some add science, social studies, 
English usage, and spelling. As a rule, an elementary achievement battery is 
produced in several fomis for use at different grade levels. If achievement 
batteries arc to be used tor measurement at several points in the pupil’s progress 
through school, it is advisable that appropriate forms of the same basic test 
be used rather than different tests. This permits more consistent charting ot 
progress and affords more valid comparisons bt tween scores made at different 
grade levels. 

Summary 

“Guided response procedures” is used as a categorical term for iruc-lalse, 
multiple-choice tests and the like. The essence of such procedures is that 
testccs have only a limited number of options tor response to any item and 
the significance ot each option is predetermined There seem to be three basic 
types ol guided response item: selcct-an-answer (true-false, matching, mul- 
tiple-choice, etc ), provide-an-answcr (fill-in, short answer, labeling, etc ) 
and arrangement of elements (ordering, assembly, etc) 1 ho procedures 
make use of all the essential devices of behavioral measurement The ques- 
tions themselves are standard stimuli. Limited options of response with pre- 
determined significance constitute standutd responses. I he items may be 
scored with a key and thus they embody standard analysis. Sampling, ot 
course, occurs because items in a test are less numerous than the components 
of achievement, personality, etc , toward which they are directed 

I he procedures yield test scores that may be considered as elassificator> 
numbers and usually can be converted into numbers indicative of rank. Scale 
symbols arc possible under certain special circumstances. 

Construction of valid and reliable guided response tests necessitates ad- 
herence to some >uch directions as these 

1. Define exactly the phenomenon to be measured and its measurable 
dimensions in behavioral terms 

2. Devise verbal and/or graphic items, the responses to which auto- 
matically will classify pupils with respect to given dimensions The classifica- 
tion usually is dichotomous i\pd in achievement tests the two categories usually 
are “knows it” and “does not know it ” To accomplish this classificatory func- 
tion accurately, guided response items should be short, affir native, simple, 
and unequivocal; couched in pupil language; not lifted from text books Col- 
lectively, the items should discriminate among all types of individuals to be 
tested; be mutually independent; sample adequately all dimensions of the 
phenomenon being measured that are subject to sampling. 



GUIDED RESPONSE PROCEDURES 


127 


3. Select items for use according to properties arfd construct 

them according to their type specificationsTBlf^ypes var> as to chance of 
guessing, as to time of administration, and as to difficulty. 

4. Assemble items and affix directions according to the intended use of 
the instrument and certain principles of test format, length, item sequence, 
timing, directions, scoiing key, and the operation of chance. 

The administration of guided response tests requires care and skill if the 
condition of standard stimulation and standard response is to be attained. 
Each pupil or other testce should have equal opportunity to be measured 
properly. Sufficient materials should be at hand. Lighting, seating, ventilation, 
etc., should be optimum. Directions usually should be given orally as well as 
in writing. A quiet and relaxed atmosphere should prevail. 

‘^Standardized test" is the collective name for guided response instruments 
that have been validated through trial and for which norms have been de- 
veloped. The norms arc the scores achieved by a population having known 
cliaracteristics or statistics derived from such scores, such as age means, grade 
means, percentile equivalents, etc. 4 he chief advantages of standardized test'^ 
over locally devised ones he in their technical excellence, their “national" 
norms, and their usually high reliability. On the weakness side is their too 
frequent inappropriatencss to the objectives of a school the organization ol 
Its curriculum, or the character of its pupils 


bXERC'lSLS 

1. for a subject in which \ou specialize, prepare sevoial lest itf ms of each 
ol the following types 


fi ue-falsc 

Multiple-choice 

Matching 

Completion or till -in 
vShort answci 
I abeling 

Ordering or assembly of elements 

2. Prepare one or more guided iespc»nse questions that will measure reasoning 
about facts and not just memory of facts 

3 Devise a set ot guided response items that combines two oi moie of the 
basic types listed in Exercise 1 

4. Study the manual ot a standardized test, administer it to a group ol pupils, 
and score the papeis 

5. Define some aspect of pupil achievement you would like to measure, stale 
Its dimensions in behavioral terms, and construct a guided response test that 
will measure pupils relative to these dimensions. Indicate the dimension to which 
each test item is keved 



CHAPTER 7 


STATISTICAL DESCRIPTIONS OF 
MEASUREMENT DATA 


We have seen in previous chapters that measurement entails the assign- 
ment of numbers or other limited symbols to the dimensions of phenomena, 
and we ha\e discussed the many procedures (observation, testing etc ; b\ 
which this assignment may be accomplished toi behavioral phenon'cna Now 
let us examine certain mathematical operations that may be necessary belorc 
our assigned symbols are fully meaningful 

fo begin, let us think of a teacher who has just rated a group of pupils 
as to the effectiveness ol their study habits and at the same time has just 
scored a multipIc-choicc geography test completed by the same pupils He 
has before him the usual first outeomes of behavioral measurement, an un- 
systematic an ay of s\mbols and numbers or, as we dcsciibe th^m tcchnicall), 
/aw data So far as the ratings o( study habits ate concerned, he may, if he 
wishes, leave the data law In the case ol the geography test, however, the 
‘ raw” data need to be arranged .irtd treated mathematically before he records 
them The diflercnce lies in the fact that the ratings in themselves describe 
the status of the pupils and thus arc mcasuics, whereas the test scores do not 
in themseKcs describe anythimz 1 hey are just numbeis Before a measure is 
accomplished, before a pupil's clas^^ification or rank or scale position is c\ 
pressed, something further must be done with them 

fust what IS to be done with these raw test scores is determined by the 
forms of measurement we seek, by the phenomena we measure, and by the 
measuring procedures wc have used But in general, we manipulate the num 
bers so as to describe the group in terms of the individuals comprising it and 
the individuals m terms of the group to which they belong We shall restrict 
ourselves in this chapter to definitions and explanations of basic statistical 
terms and processes that ma\ be used to describe groups and individuals 

Tabular and Graphical Portrayal of Group Measurements 

The simplest step for organizing numerical data in tabular or graphical 
form is to arrange the numbers in decreasing or increasing order of size This 
IS commonly called a disUihution of scores Typically, the highest score is 
written first, followed by the next highest, and so on until the lowest score is 

130 



STATISTICAL DESCRIPTIONS OF MFASl 


'A 


131 


reached. If a certain score occurs more than^j^Tfi the data, then it is re- 
peated the necessary number of times in its proper position in the sequence 
The following is an example of such an arrangement. 


55 

43 

39 

21 

53 

41 

35 

19 

52 

40 

36 

19 

49 

40 

31 

17 

49 

40 

31 

16 


IKrQLihNCY TABILS 

Ordinarily, for relatively small groups, such a simple arrangement would 
be all that is necessary for descriptive purposes or foi any further calculations. 
For larger groups of data, say one hundred scores or more, such an anange- 
ment would be very little improvement in way of further organization The 
data would still be largch unmanageable for further calculations and what- 
ever group pattern might exist in the data would still be unidentifiable. Dis- 
tributions containing a large number of scores need a more compact tabular 
arrangement This is provided lor by the device of grouping the scores into 
intervals' and then indicating the number or Irequency ot scores in each in- 
terval. 1 his is called a frequency table. 

For our purposes, wc shall define an intenal as an> indicated grouping ol 
numerical scores. An interval, then, m.iv contain only one score or may con- 
tain several scores. Conventionally, if an interval contains only one score, say 
76, then the interval extends from 75*6 to 7612 Graphically, this intcival 
w^ould be indicated as follows 


7s 1 7u lb\ 


When several scoies arc included in an interval, the interval is designated b) 
a lower and an upper boundary scort. For example, the interval 27-32 in- 
cludes tlic following SIX scores: 27, 28, 29, 30, 31, 32; and graphically thl^ 
interval would look like this 

2b] 27 28 2 *^^ 30 31 "^2 ^ 2 ‘ 

Note that the interval extends a half unit below (26*''2 ) and a half unit above 
(3216) the designated boundaries Ihcsc are called the actual boundaries of 
an interval. 

The interval size is the number ot scores contained in the intcival or the 
difference between the actual boundaries of interval, both being numencalh 
the same. Thus, the size of the interval 27-32 is 6, because it contains 6 
scores and because the difference between its actual boundaries. 26*6 and 
32*6. IS 6. The mid-point of an interval is found by taking half of the interval 



132 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

size and adding it to the lower actual boundary or subtracting it from the 
upper actual boundary For example, the mid-pomt of the interval 27-32 is 
found by taking half ol the interval size ( of 6 — 3 ) and adding it to the 
lower actual boundary (26^ 2 f 3 — 29' 2 ) or subtracting it from the upper 
actual boundar} (32V2 - 3 — 29^2 ) In both cases the mid-point is 29V2 
As a summaiy, here arc some further examples that will illustrate the 
different aspects of intervals just discussed 


Designated 

1nt(t\al 

A( tiial 

Boiotdat V / iniits 

Size 

C on uciitn i 

Scon s Inc Inch d 

Mid- pent! t 

M 37 

^P/2 to 37>2 

6 

32 n 34 35 36,37 

34' 2 

52 53 

511 . lo 531^ 

2 

52 s" 

52'2 

68-71 

671 , lo 7)1 

4 

6S 69 70 71 

69' ^ 

40 44 

391. to 441- 

s 

40 41 42 43 44 

42 


A sequence of intcivals of the same size provides a convenient device tor 
tabulating the raw scores for instance, it the raw scores ranged frcmi ^0 to 
90, then the following sequence of intervals could be used 

90 94 
S9 
80 S4 
75 79 
70 74 
Os 69 
()0 64 
S5 S9 
50 54 

Note that the sequence of intervals is arranged so that thcie is no overl ipping 
c^f scores and hence no ambiguitv as lo the lntcr^ai to which a scoie belongs 
I he numbei of intervals used in a sequence to cover a given lange of 
scores may var> accvirding to the purpose on^ has m mind lor groupme, the 
scores, and to experience and judanient Most lrequcn11> scores arc gi on peel 
to determine the over-all pattern in the distribution If this is the case, the 
following is a NUggested guide for dctcniiming the number of intervals to use 
in a sequence 

N umht r oj Scons Sm^fisKd Nutnhif of 

* in Distribution Inui \ (ils to Use 


30 

6 

to 

8 

50 

7 

to 

9 

100 

8 

to 

10 

300 

9 

to 

12 

500 

10 

to 

14 

1,000 

12 

to 

17 



SrATlSlICAL DESCRIPTIONS OF MEASUHj^ENT DATA 133 

Once the number of intervals has been determined, it is a simple matter 
to compute the si/e ol the intervals First find the range of the scores by sub- 
tracting the lowest score from the highest Then divide the range by the nuni- 
bci ol intervals to be used and the result is the approximaie si7e of the in- 
terval After the sctiucnce ol mtcnals has been sit up. the scores arc tallied m 
the intervals and the result is a fieqiienL \ itihlc 

lo review the steps just dcseiibed, let us begin with some ‘ raw data” and 
outline the things necessary to set up a frequency table that will best describe 
the distiibution ol the scoies 


I ABLE t 

\n Oiithnc for C tnisti uctinj? i FriquciK) 



Steps lo follow 




Ruu 

lata 





1 

Note the number ot seon-s in 

119 

112, 

9X, 


too. 

un. 

1 12, 

108, 

105 

105 


the 1 ivN eliti In tins tsamplc 

94 

106, 

100 

109, 

97 

11^ 

102 

107 

90 

121 


Iheie arc 90 sc oils 

111 

101 

lOh. 

102 

96. 

95, 

MO, 

1 n. 

107 

101 



127 

104 

in. 

104, 

11/ 

106, 

9h 

91, 

10), 

104 


Deltimme the elesir iblc numbci 

1 IS 

97 

116 

101 

ill 

I0> 

124 

109 

98, 

111 


of inleivils lo use litre sry 

92 

1 10, 

lOS 

u)-^ 

107, 

in 

92 

106, 

100, 

126 


9 Ol 10 inter \ Os 

109 

!('• 

1 IK 

111 

102 

97 

104 

95, 

122 

119 



ns 

in 

12S 

104 

1 

121 

106 

in. 

MO, 

96 


Dtleimait the i u i»c by sub 

95 

107 

109, 

10^ 

120, 

10^ 

107, 

114, 

108, 

116 


t letinF the lowtsl score tiom 












tliL highest score 












in 91 4^ 




1 f( \ucm 'v 

tahli 






hitc f \ ai 



Falls 


r ttqucm \ 



no 

rn 

1 






1 


4 

Oacimine the i/e ot iIk tri 

125 

129 

11 









ter\ ils b\ (Jividi the r mgu 

120 

i:t 

1 1 

ill 1 





6 



h\ 'he number td rnterx rls 

1 IS 

119 

1 1 

Ml 1 

nil 




10 



de sired V 9 S 

1 im 1 14 

1 1 

Mi 1 

Mil 

Mil 



14 



ippiOMmalc’> 

105 

109 

1 1 

Ml 1 

MM 

Mill M 

Ml ( 

21 




100 

104 

1 1 

MI 1 

Mil 

MM 

1 11 

11 

19 



Set up the sequerue of mteiv ils 

95 

99 

il 

111 1 

Mil 

1 



11 




90 

94 

11 

Ml 





5 



lall> the seoiLS in the proper 









- 



intor\ als J otal ^0 


Now lei US take a look at two othei frcciLicnc> tables ot the same raw 
data In Table A, ihe si/e ol the intcnal is smaller and hence the number of 
intervals to cover the range is greater In Table Z?, the size of the intervals 
IS huger and the number ot intervals required arc fewer 



134 


fundamental conceptions and procedures 


(A) (B) 


IftUlMll 

/ teqm nc\ 

Intenal 

1 te queue \ 

\M 

1 

no n9 

1 

po-ni 

0 

120-129 

9 

128-129 

1 

110 119 

24 

126-127 

2 

100 109 

40 

I24-12S 

1 

90 99 

16 

122 \1^ 

2 


- 

120-121 


Total 

90 


1 1 8 1 J 4 

1 16 1 P ^ 

114 11^ 4 

1 1 2 1 r> 6 

1 1 0 1 1 1 7 

1 08 1 0^ 8 

106 107 9 

104 10^ 9 

102 10 ^ 7 

100-101 7 

98 99 ^ 

96- 97 5 

94 9^ 4 

92 9 ^ 1 

90 9 1 1 

loiil 90 


Note ihjl in I able 4 ibc distribution is so spr<.ad out that it is lathci 
dillicult to dctciminc any possible group pattern Likewise in 1 able B, the 
distribution is so buns bed up that it is still dilheull to determine any gre)up 
pattern Thus we see that dillcrent numbeis ol intervals atTeet the pattern ol 
the distribution Somewhere between these two extreme examples lies the 
optimum number of intervals that will best exhibit whatexei group pattern 
may exist in the distribution ol scores Ihe original treejueney table exhibited 
m I able 3 docs show a group pattern or shape in the distribution This Ire 
c|ucne> table indicates lewer scores at the upper end ot the scale with most 
of the scores grouped at the lower end ol the SLale 

With just the Ircqucncv table, it is Jiflicult to visualize the distribuMon of 
the scores Fheicforc, wc turn to graphic methods ot portraying the data 
[here are many diflercnt ways of graphieall) portraying a frequency distribu- 
tion However, certain of these ways arc particularly stra’ghlforward and 
efTective, and will be especially useful in the later developments ot further 
statistical measures Among these giaphical representations we shall discuss 
the histogram, the frequency polygon, and the smoothed frequency curve, 
all of which arc based on the frequency table 



STAIIS IICAL DESCRIP I IONS OF M£ASURE^li:NT ©ATA 


135 


HISTOGRAMS 

The histogram is essentially a vertical bar graph and is particularly adapt- 
able to the intervals of a Irequency table 1 he first step in constructing a 
histogram from a frequency table is to set up a horizontal axu and a vertical 
axis. Then on the hoiizontal axis, lay oil the boundary marks ot the intervals 
and on the ver'.cal axis, lay oil a scale ot frequency per interval Ihc final 
step consists of erecting a vertical bar at each interval representing the fre- 
quency tor that intcival. An illustration will pci haps serve to clariiy the steps 
that are involved in setting up a histogram This is presented in oiiiline form 
in \ able 4 


lABLl 4 

\n Oiitlinc ioi ionstiucting a Hislogi an 


1 



Step to tolloxN 

h It 1 \ 1 

/ UtilH 

Slai t 

with the ^ eqiiency distribution 

1 U) \^4 

1 

\ou 

wish to icpicsent graphically 

12s 129 

7 

IS 1 

hisiogr nn 

120-124 

6 



115-119 

10 



110 114 

1 1 



105 109 

21 



100 104 

19 



95- 9‘) 

1 I 



90 9 1 

s 



1 ot il 

90 


Set Lip ,i VLiliLcfl aKis aiiti a htui/onlil axis 

< 

"o 

u 

O 

> 


Hor'zonta! Axis 


^ L iv oil houndar> mark^ ol in 
icivals on hon/ontal axis and fre 
quency per interval on vertudi axis 




136 


FUNDAMENTAL rONri'’*ONS *\ND PROCEDURES 



Looking dt the finished histogram we can rcadil) see that it aivcs an 
effective visual presentation of group pattern, and is simple to construct 
There arc some important features to note concerning the histc'gram First 
of all, the boundary niaiks on the hoiizontal axis are onl\ approximate hoi 
instance, the boundary mark 95 on the histogram should teehnicalK be 94'/2 
The actual boundarv marks have all been rounded off to the next highest 
whole number, and this is dcmi solely for eonvenicnee of representation It is 
also important to note that the scale on the veitieal axis represents freepicncy 
per indicated interval rither than jiid frceiueney fhis necessitates the intervals 
along the horizontal axis being ol equal size Ihe fiequenev lot each interval 
is actually represented by the area ol the vertie il bar erected !it tint interval 
As an example, for the first mterv il of the histogiam its lieeiuenev ol 5 is 
represented bv the shaded area of the vertical bir shown with the following 
portion of the histogram 


5-r 



It would follow, then, that the sum ol the iieas of the veitieal bars of the 
histogram represents the total trequer ev ol 90 scores 


tREQUlNCV I'OIVGONS 

Ihe irequeney polygon is formed by simply eonneeting the mid points ol 
the tops ol the bars of the histogram with straight lines, as shown m Figuie 17 



137 


STATISIICAL DESCRIPIIONS OF ^EA^p^^ kTA 



Score Intervals 
P I requcnc> polygon 

By dr.iwine the trcqucncN pol>izon through the niid p(3ints of the vertical 
bars (3t tlK histogram the area under the polygon is equivalent to the area of 
the histogram This is shown by the lollowing hgiire in which a straight line 
connects the mid-points ol two adjacent vertical bars 



liiangle A cut oil Irom the right bai is ccjual to triangle B addtd to the 
kt! bar ot the histogram I his occurs throughout the histogram Ic wing the 
.aca unchanged 

Since the frequency po!\gon is made up ot strneht lines connecting suc- 
CLSsue points, the icsult gciiciallv is a )agc«-d clTcct fhis can be oflsct some- 
wh.it by an increase in the number oi scores However il this i'. not possible, 
the ja^’ged dlecl ol the licquenev polygon ma\ be rLdiiccd b\ simplv sketching 



Figiiie 18 Smoothed fiequcncv cuivc 


138 


FUNDAMENTAL CONCEPTIONS AND PROC EDURES 


a smooth curve that approximates the gencr ** shape of the frequency polygon. 
The result is called a smooththtJfirfequency and is illustrated in Figure 18. 
The smoothed curve is more pleasing to the eye and yet is very effective in 
portraying the over-all pattern of a distribution of scores. Once again the area 
under the curve represents the total number of scores in the distribution and 
the scale on the vertical axis represents the frequency for the intervals speefied 
in the original frequency tabic. 

The histogram, frequency polygon, and smoothed frequency curve are 
some of the many ways in which group measures may be graphically presented 
and are the ones most commonly encountered in educational measurement. 
At this point it might be well to exhibit and identify some of the possible 
patterns of group distributions 



F igiMc I Bell slijpoil 
or noi m.il clisti ibtitjon 


20 S\minctii 
tdl (ll^l^IbutIon 





F igiii c 21 I shapt'd ()i 

I\)iss(in dislT ibiition 


Figuic 2*^ Rccltinuu 
hii distnbiition 


I iguic 2A t'-slicipcd 
disli il>iition 



biguie 2S Bi-niodal 
disti ibiition 


The first (higurc 19) is a theoretical curve and will be discussed in more 
detail later (pages 163-169). Jt is essentially a curve ol normal probability. 
The remaining figures arc presented .is some possible types of distributions 
of group measures and are identified by name. 


SIATISTICAL DESCRIPTIONS OF MFASUREMENf p\tA 139 

Wc stated at the beginning - this chajig? th^t Ibe purpose ol statistical 
presentations was to make unnicaningtul measurement data meaningful, or 
to give additional meaning to already meaningful data It is well to ask then 
what do these frequency tables, these histogiams, and these frequency 
polygons tell us about the pupils who made the scores that comprise the tables 
and graphs^ What piecise appraisal of status do we have that we did not have 
before wc made the tables or constructed the graphs? 

Well, from the ficquency table and from the graphs we may observe which 
scores weic lowest and which were highest At a glance we may judge the 
total range of seoies and the range of scores that constitute the central cluster, 
or wc may observe that there is no central cluster Simply by looking we may 
sec that the majority of the group scores arc high low, or in the middle 
I mally knowing John s or Harry s score wc can estimate his rank m the 
group and wc can express some qualification^ about that rank He i at the 
top but It is a close distribution He is about it the middle, but the middle is 
skewed toward the high end of the scores 

Describing a Group by a Representative Score 

Wc turn now from describing groups by tables and iiiaphs to desenbmti 
uroups by representative scores Ihe use of a rc piescntativc measure is a 
common everyday oceurrenee The aveiage golf score ol a person say 76, 
leprcscntative of that peison ^ golfing performance If a certain city publicizes 
that Its mean daily temperature is 8^ I this likewise represents that city s 
climate And one way lor a teacher to describe the perloimanee ol her class 
(m an arithnutie aehie\emcnt test is to state a snide score th it Ixst indicate*' 
the over all perfornianee ol the group Because it is to represent we w int 
that score, if possible to be the one about which all the rest ol the s ores tend 
to cluster For this reason statisticians call these repiesentative scores ineas 
Lires ol central tendency In our discussion of representative scores or meas 
urcs of central lendene\, wc shall consider the mode, the median and the 
mean which is often called the average ’ 

MODI 

One way c?l SLleeting i rcpiesent iti\c score from a I lole or distribution 
ol seems IS to choose that score which occins most Irequcntly Whin this is 
done the resulting score is called the tiunh just as the style of elothine most 
Irequcntly seen is called the mode Among the raw scons pic'.cnted in fable 
^ the score lOS occurs the most frequently and is theielor.. the mode of the 
distiibution When the raw scores are grouped b> intervals in i t ibk , 
the mode is aibitiarily delincd as the nnd-pomt of the interval m which the 
largest frequency occurs Again referring to Fable and this time to the 
treejucncy table dc\ eloped there, wc find that the interval 105-109 has the 
largest frequency and therefore its midpoint 107 is the mode I he dis- 
crepancy between the mode of 105 determined fiom the raw ungrouped scores 
and the mode of J07 dctcimmcd fiom the grouped scores can be explained 



140 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


by the fact that, in frequency tables, the raw scores lose their individual 
identity and the mid-points ot intervals are considered representative ot all the 
scores in the intervals As we progress, we shall note further discrepancies be- 
tween computations based on ungrouped dati and computations based on 
grouped data 

Ihe mode as defined here is certainly simple and ea'^y to determine by 
inspection However at best ihc mode is only a crude approximation of a 
representative score around which the rest of the scores lend to cluster 


MEDIAN 

A second type ot score may be used to represent a distribution of 
scores thit score which divides the distribui on in halt when the scores arc 
arranged according to si/c \ score which docs this 's c died the median of 
the distribution 

This definition of the median cm b^ ipplud direcllv to a distribution of 
ungrouped scores The fust step is to arrance the i iw scores in carder of size 
Then choose tint score in the distribution ibove which ind below which there 
arc m ecjual number of scores The following two examples will scr\c to illus 
tratc xhK. dcterminUion of the median from a few scores already arranged 
according to si/c 


1 (I ) 

■•s 

''9 

27 

I he mcdi in is 20 ^ 20 

19 
i 1 
S 


ih) 

S2 

48 

" The mcdi m is ^6 
27 
20 


In the first distribution the medim is 20 be c ui sc there are three scores 
above and three scores below In the second distribution however there is 
no single middle seoic which divide the distribution, there foie the median 
IS determined to be midwa> between the two central scores ot the distribution 
Ihe two ecntial scores in the second distribution are 33 ind 39 and the mid- 
way point between these two scores is 36 which is the iiKdi in of the distnbu 
tion Tn general when there is an odd number ol measures then there is a 
single measure that divides the distribution in hilf For an even number ol 
measures it is neeessar) to find the two central scenes and the median is the 
midway point between these two scores 

If wc are concerned with a large number ot scores the method just out 
lined for finding the me dun amon^ ungrouped dat i would become very 
tedious 1 his will be apparent if you consider the task of arranging 1 ,000 
scores in order of size Hence, we need to have some method for determining 
the median of scores that have been grouped in intervals 



STATISTICAL DESCRIPTIONS OF MEASUREMENT DATA 


141 


To understand the median of grouped numbers, let us set up the dehnition 
of the median in terms of a histogram We have already observed that the 
area of the histogram represents the total number of scores in the distribution 
Therefore, to be consistent with the pre\ious definition, the median is now 
dehned as the point on the horizontal axis of the histogram where SO per cent 
of the area lies to the left and 50 per cent lies to the right of the point This 
IS ilhislriled in the following figure 


50% of area 50% of area 



Figure 26 Mcdiin defined in tcirus 
of the iK i of i\ hislogrini 

Looking <il the histogram in J igure 26, it is apparent that finding the 
median of gieniped scores is esscntiallv a mattci ot first dcteimmmg the middle 
interval that contains the median and sceondh ot having a piecisc wa> ol 
fixing the point within the middle interval lo ste iiist how to obtain a mediin 
lor gioupcd scoics kl us go step bv step through (he proe»^ss 


S( on 

Intcjxal 


h fuf lit cx 

80 84 

75 79 

7f 

7() scoics above 

70 74 

5( 


OS-OO 

1 1 


00-64 

16 

Middk inicjval 

55 59 

10 ) 

18 soojcs below 

50 

8) 


total 

()0 




Scoie Intervals 


Figuie 2 '^ Finding i mcdiin illustrilcd 

With reference to f igure 27, our first pioblem is to find the middle in- 
terval that contains the median This is done first dividing the total number 
of scores by 2, in this example 60 2 — I here should be 30 scores above 

and 30 scores bekiw the median Now we stait with the lowest interval and 
begin adding in the frequency column until we come lo the interval where the 


142 FUNDAMENTAL CONCIPTIONS AND PROCEDURES 

total frequency exceeds 30,jflT the total Adding the frequencies of the 
first two lower intervals we^l 8 4-10=18 Adding the frequency of the 
third interval to the frequencies of the first two, we get 16 f 18 = 34, which 
c;icccds 30 Therefore, the third interval Irom the bottom, 60-64, is the 
middle interval and contains the median We can verity this b) starting trom 
the top and adding dov\n the frequency column Adding the frequencies ol 
the first 4 intervals irom the top, wc obtain 3 t 7 4 1 1 1 26 I o add 

the frequency oi the fifth interval to this sum causes the total to exceed 30 
thus again the interval 60-64 is the middle intcivil and it has IS scoics 
below It and 26 scoics above it 

With the middle interval designated, the next pioccdurc is to locate the 
median in the interval We ma> analyze this tep by using f igurc 2S, which 
reproduces the middle inUival ol the histogram m I mure 27 


Middle Inlerval 


15- 


— r 

(16 Scores] 

10-^ 

I 18 scores 

1 

a> 1 o 
o o 

26 scores 


1 belo^v 

U 1 u 
u> ' vt 

1 

above 

5 - 


1 

U 



69 5 ^ 64 6 
Median 

ticiifc 1 oc lUnq Iht me li tii in the midtllc inrcrx il 


[ rom 1 igurc 2S )ou can see th it 12 scoics need to be added to tlic hS scores 
below the interval to mike the neccssar> 30 Similailv 4 scores are ncccssaiv 
to increase the 26 scoics above the interval to the nceessai> 30 In othci 
words the area of the middle vertical bar rcpicscnting the 16 scores must be 
divided by the median so that a portion of the area reprcsentim’ 12 scores lies 
to the left and the remaining portion of the area representing 4 scores lies to 
the right This would satisf\ the rcc]uircment that SO pci cent of the area 
ol the histogram or 30 scores he on either side of the median 

J he stage is set now for the final calculation First note that the actual 
boundarKs for the interval 60-64 are 59 5 and 64 S as indicated in Figure 28 
and the si/e ol the inteival is S units along the horizontal axis Since 12 of the 
16 scores must he beyond the lower boundar) SO S, then tlu median must be 
located 12/I6ths of five units or 334 units above S9 S This makes the median 
equal to S9 S j 3 7S ^ 63 25 This may be verified by noting that 4 of the 
16 scores or 4/16ths of 5 units must he below the upper boundary 64 S Thus 
the median is again equal to 64 5 — 1 25 — 63 2S 

Table 5 summarizes the previous detailed discussion of the steps in com- 
puting the niedian of a distribution of scores 



STATISTICAL DESCRIPTIONS OF MEASUREMENT |>ATA 


143 


FABLh 5 

An Outline for Computing the Median of a Distnbiituui 


Steps to follow 

1 Divide the total numbei of stoics by 2 
138 — 2 — 69 Thus 69 scoies must lie 
above and below the medun 


2 Lot itc the middle interval containing the 
median by counting up or down the fre 
quency column until Iht 69th stoic is 
reached Ihc middle intenal is 63-65 and 
^>8 scoies he below it 

^ find the distance trtim tht lower boundaiy 
of the middle inteisal to the median Sinu 
S8 scoies he below the middle mteival then 
69 minus oi 11 moie stoics of the 14 
scoies ot middle intcival aic iicedcci to 
I each the mcdi ui I he si/e of cich intcivil 
IS 1 units Ihc distance is equ il to ll/4Hths 
ot Ihicc units or 7s units 


72-74 

69-71 

lo«u ^ 
boundaiy 63-65 

62 5 — 

60-62 
s7 59 
54 56 
51 - 5 ^ 
; 8-50 


1 

_r_ middle 
44' intenal 

28 

19 I ‘'Xscous 
7 lit below 

1 


lolal 138 


I \dd this distance to the lowci boundaiy 
of the middle inleival and the usiilt is the 
median of the disti ibution I he lowei 
boundaiy of the middle inlcisal is (>2 5 
Ihus 62 5 “’5 6't IS the median 


Ihc median found in lublc 5 ma\ he verified b\ noting that 3b scores iic 
ahoic llic middfe interval and ihciefore 36 or 33 more scores of the 44 
in the middfc interval arc needed The distance Ironi the upper boundary ot 
the interval down to the medun is 33/44lhs of 3 units or 2 2h units vSuh- 
traeting this distance from the upper boundary, we get 6s 5 - 2 equals 
63 25, which is the same as we found b> starting our count Irom the bottom 
From the compulation in Table 5 we may observe that il is possible to 
compute the median diieetly from the frequenc) table without rekriing [o 
the histogram As an introduction, however the use of the histogram gives 
us cl better visual picture of what is being done in the computation of the 
median It is important to note that in the tomputaiion ot the median the 
scores in the middle interval are assumed to be evenly distributed throughout 
the interval This assumption is certainly necessary to facilitate computation, 
and this is the explanation of any discrepancy between the median computed 
directly from the raw scores and the median computed from a frequcncv table 
ol the same scores In any case whether computing from raw scores or 



144 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


grouped scores, the computation of the median is essentially a process of 
counting scores arranged in order until the middle score is reached. 

ARITHMETIC MEAN 

•Another frequently used representative measure of a group of scores is 
the arithmetic mean, often referred to as the mean or the average. In every- 
day affairs, we hear and read such phrases as average income, mean annual 
rainfall, average number of runs batted in, average yards gained from the line 
of scrimmage, and average class size. In each of these instances, the arithmetic 
mean is the single measure used to represent a group of measures. 

Everyone is probably familiar with the computation of an average or 
mean. The mean of a group of scores is defined as the sum of the scores 
divided by the number of scores. For example, to find the arithmetic mean 
of the following scores: 12, 23, 15, 30, 8, 6, IS. 



12 




First, find the 

23 

T hen, divide 


The rx'sull 

total sum of the 

15 

the total by 


is the 

scores. 

30 

the number of 

-- 10 

aiithmctic 


S 

scores in the 


mean. 


6 

grou[). 




18 




Total 

112 





This definition is particularly applicable to a set ol raw, ungrouped scores. 
Since we are interested only in iho total sum of the scores, it is not neccssar> 
to arrange the scores in order of size as was necessary for the median. 

In setting up the definition of the arithmetic mean symbc)lically, il we let 

X represent each ol the raw, ungrouped scores 

i-, the (ireek letter sigma, lepresent “the sum ol” or “summation of* 

N represent the total number of scores in the group 
and X (A"-hai) represent the arithmetic mean 
:i:Y 

then, A — - ^ 

In the example, X represerts each of the raw scores 12, 23, 15, 30, 8, 6, 
18; N is 7; '^X equals 1 12; and X equals 16, the arithmetic mean of the seven 
scores. 

Next, it is necessary to define what is meant by the “deviation” of a score 
from the mean. If wc subtract the mean from any score, hie result is the 
deviation of that score from the mean. Symbolically, if we let 

X represent the raw score 
A'' represent the arithmetic mean 



145 


STATISTICAL DESCRIPTIONS OF MEASUREMENT DATA 

and d represent the deviation of the score 
from the mean 
then, d ~i X — X 

If the mean of a distribution is equal to 22 and we wish to find the devia- 
tion of a particular score in the distribution, sav 2^), then according to the 
definition, deviation d ~ 29 — 22 — | 7 units. Similarly, the deviation of the 
score 17 is 17 - 22 — _ — 5 units. The plus and minus signs are neces- 
sary to indicate whether the deviations are above or below the mean. 

Now let us refer to Liu 7 scores for which we computed a mean. This 
lime we shall compute the deviation of each score from the mean (X - 16). 


Raw Scote 

Dc\ lution 

Deviations 

De\ latiofis 

X 

d ^X -Y 

Above Mean 

Below Mean 

12 

23 

- 4 

4 7 

-f 7 

- 4 

15 

- 1 

■1 14 

- 1 

30 

-1 14 

4 2 

- 8 

8 

6 

- 8 

-10 

4 23 

-10 

18 

1 2 


-^3 


If v\c add the deviations of the scoies above the mean, we obtain 23. 
SimilaiK, if we add the deviations ol the scores below the mean, we get -23. 
And if w'c add the two sums, -t-23 -b 23 — 0 we get zero. Thi^^ is a very 
important propeity ol the arithmetic mean; namely, that the sum ol the devia- 
tions of scores above the mean is equal to the sum ol the deviations of scoies 
below the mean. In other words, the sum of the deviations ol all the score'^ 
from the mean is zero. If you wish actually to observe this prt)peity, take a 
ruled stick and place small blocks of equal weight at positions on the stick 
that cor^c^pond to the seven scores, 12, 23, 15, ^0, 8, 6, and iS, as in 
bigurc 29 

I I I \^\ I I \'\^\ 

10 15 ! 20 25 30 

^Mean = 16 

Hgujc 29. Mean as a point ol balance. 

The stick will balance as indicated if supported at the mean. Thus the 
arithmetic mean is representative ot a group of scores in the same manner as 
the center of gravity is representative of all the parts of a rigid body 



146 


fUNDAMENFAL CONCEPTIONS AND PROCEDURES 


1 he iact that the sum of the itviations of scores from the mean is zero 
provides another method for computing the mean We can guess what the 
mean of a distribution is likely to be and then correct our guess for the true 
mean. This method can be shown by using again our seven scoics 


Start by guessing 

Raw Scotes 

DixiiUion (d' 

) ffoni 

a mean Here wc 

V 

Cue \ \L (J Sc on /9 

guess that 1 is 

12 

- 7 


the mean 

l^ 

4 



1 s 

4 


1 md the devialion 

h) 

-i 1 1 


{it) ol each scoie 

S 

1 I 


Irom 19 

6 

I ^ 

( Ol icction 

Add tlic deviations 

IS 

1 

' 2 1 

21 

/\ 7 


and divide by ihe 
numbei ot devhilions 
This IS the Lonutiou 

) true Mtan Guessed Scurc i ( orrcctu'n 

rvT - lb 

4 Add the correction to the 
guessed score to obtain 
tiue mean 

For ungrouped scoies tins method lias po pailieiilai advantage ovei that ol 
simply adding the scores and dividing by the numbei ol scores But it is the 
conventional method lor scores grouped m mtcrvtiK and here it does save 
time and energy 

lo compute a mean Irom d large number oi ungrouped scoie<, many ol 
which arc duplicated, we sun use a computing niiichme and apply the lirsl 

V y 

lormula X - Oi, il none is available N\e nia\ lue a shghtiv dilTcrcnt 

’ N 

lormula and proceed .is follows 


Sf ot i 

/ t < if Hi tli V 


X 

/ 

f\ 


__ 

"~49" 

12 

9 

lOS 

1 1 

17 

1S7 

10 

42 

420 

9 

24 

210 

8 

19 

1^2 

7 

1 1 

77 

6 

s 

40 


11 

§1 

1/X ^ 1 129 



STATISTICAL DESCRIPTIONS OF MEASUREMENT DATA 147 

As indicated above, multiply each score b^ts frequency and then find the 
total of these products (S/AT). Divide this sum by the total number of scores 

( result is the mean of the scores For scores m a frequency 

table, the original definition {X becomes where f is 

the frequency of each of the scores. 

We are now ready to learn the computation of the mean for scores grouped 
in intervals. In obtaining a mean from grouped scores, we combine the process 
of correcting a guess and of multiplying a score by ils frequency We need to 
assume that the mid-point of each interval represents the scores in the interval. 

Using the same scores for which we obtained a median (Table 5), we 
say that our guessed mean (Af,,) is the mid-point of the interval 60-62 or 61. 
Our table and calculations should look like this 


Mid-point Deviation F reqiienc y 


In ten ai 

X d' 

- (A - \ ) 

/ 

fd' 

12-14 

73 

+ 12 

2 

+ 24 

f^9^1] 

70 

9 

11 

1 117 

66 68 

67 

[ 6 

21 

1 126 

65 

64 

1 3 

44 

. 132 

60-62 

61 

0 

28 

0 

57-59 

58 

1 

19 

- 57 

54-56 

5!) 

- 6 

7 

~ 42 

51-51 

52 

- 9 

1 

- 27 

48 50 

49 

-12 

1 

~ 12 


N - ns = 


True Mc*m = Guessed Mean Correction 




N 


61-1 1 9 - 62.9 


C oncction 
^fd' 
V 


-f 261 

' ns 


rz -f 1 9 


In the deviation column (c/') ol the above table, v^e note that each devia- 
tion may be divided by 3, which is the size of each of the intervals. If this is 
done, then the {d' ) column, which formerly looked like this. -[-12, -} 9, -f 6, 
f3, 0, -3, —6, —9, —12 nev becomes this -f~4, -^3, -|-2, -f-1, 0, --1, 
—2, —3, —4. This way oi expressing deviations, in effect by counting in- 
tervals, is conventional and simplifies our calculations We shall symbolize 
such deviations by the small letter it. When we use u, we must later multiply 
S/m by the size of the interval so as to have the proper correction 

The usual computation of the mean is made from grouped scores, uses 
the guessed mean and correction method, and counts interval deviations rather 
than scoie deviations. This computation, the one you should use, is described 



148 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


in Table 6 Note that the “assumed mean” is used here rather than 
“guessed mean.” The two mean the same thing and may be used interchange- 
ably 


TABIL 6 

An Outline foi Computing the Mean from 
the Formula \ A -f 


Steps to follov\ 

1 Select an interval ^^hose 
mid point will be the is 
sumed mean Heie we select 
the interval 60-62, and 

Its midpoint, 61, is 
the assumed mean ( \ ) 

2 Set up the w column in tcims 
of tht number of intcrv ds 
ibovc and belovv the selected 
interval 60 62 


hitLi\ al 

/ 

u 

In 

72 74 


|4 

4 8 

69 71 

n 


1 3‘) 

66 68 

21 

1 2 

42 

6S 

44 

1 1 

4 44 

()<l 62 

28 

0 

0 

S7 ^9 

I‘> 

I 

19 

^4 S6 

7 

2 

14 

M SI 

1 

- 3 

9 

48 SO 

1 

4 

4 

V 

ns 

1/u 

^ 87 




4 


5 


Multiply each / b> u and then 
find (he sum of liie eohimn 
afu) 

( orrect on 


Multiply this sum (2-/u) by the 

r 

^ 87 

261 

M 9 

size of the inters ils (/ 3 ) 

and then divide bv N This is 

the correction ( 

A 

1 3ft 

l^H 

V A / 

Add this con cel ion to the as 

T X 

\ 

61 

119 f -> 9 


sumed mean ind the it suit i 


the true mem V \ 


V 


i 


46 


I he histogram may be used to illustrate the position ol the mean as it 
was to illustrate the median You will recall that the median is the point on 
the histogram that divides the area in hall Now, if we visualize the histogram 
as being cut out of cardboard or some other stiff material having weight, 
then the mean is the point on the base on which the histogran balances itself 
This IS illustrated in Figure "^0, m which the histogram is drawn from the 
frequency distribution in 1 able 6 



STATISTICAL DESCRIPTIONS OF MEASUREMENT DA FA 


149 



Figure 30. Illusli Siting the mean as the center 
t)f balance of the histogram. 


COMPARISON 01 THE MEDIAN AND MEAN 

Although the median and the mean are both representative scores ot a 
group of scores, we can see now that the definitions and the processes 
finding them are quite diflercnt. The median is sought by a process of counting 
ranked scores until the middle score is reached. The mean is found through 
a process of determining the balance point of the scores. T hese diflerences 
have implications in the application of the mean and med'an, which wc shall 
now discuss. 

In the first place, if the distribution of scores is symmetrical or nearly sym- 
meincal then there is little or no diflerencc between the middle point and the 
balance point of the distribution. Consequently, the mean and median would 
have about the same aunicrical value, and in this sense there would be no 
choice between the two measures. 

For skewed distributions, vw the other hand, the numerical values of the 
mean and median will ordinarily be quite dilicrent, and W'e shall discuss such 
cases with the aid of Figure 31. 



Figuie 31. 'Jhc effccl of skewed distributions on the mean and median. 



150 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

As illustrated, the meiik|j<P||f>ul^ toward the end of the distribution in 
which there are a few extreme scores, while the median continues to divide 
the area in half. The choice of whether to use the mean or the median for 
skewed distributions will depend upon the purpose of the user. If it is desired 
to avoid any undue influence of extreme scores, then the median should be 
preferred. Often both the mean and median arc reported to indicate whether 
or not the distribution is skewed. Foi example, suppose it is reported that the 
mean is equal to 47 and the median is equal to 55 for a particular distribution. 
Then we would suspect that there arc a few extremely small scores in the 
distribution and that the graph of the distribution would be similar to Figure 
31-B. 

A further distinction between the median and the mean must be made 
with respect to the types of scores they may represent. You may recall that 
in the first chapter we presented some measurement symbols that expressed 
scale position and others that expressed rank position. In the calculation of 
the mean it is necessary cither to add the scores or to determine deviations of 
the scores, and both of these processes require unit-scale numbers. As a result, 
the mean should be used only for distributions of unit-scale scores and would 
be meaningless if used with scores expressing rank or order. The median, on 
the other hand, may be used with either type of measurement symbol and there- 
fore it is the most widely applicable descriptive measure of central tendency. 
From the standpoint of theoretical statistics, the mean has certiyn advantages 
since it figures in the computation ol other statistical measures such as the 
standard deviation, the correlation coefficient, and the analysis of variance. 

Describing the Variability of a Croup 

It should take only a moment’s reflection to realize th.it to describe a group 
of measures by a single representative measure would be incomplete. There 
certainly would need to be some indication of the spread, or dispersion, or 
variability of the scores in the group. To know, for example, that the aveiage 
annual temperature in two American cities is the same and equal to 70'' 
would still leave one to wonder about the climate of the two cities. In one city 
the temperature could range from extremely cold to extremely hot, and still 
have an average temperature of 70^'. In the other city, which has the same 
average temperature, the variation could be small and hence would have a 
more favorable climate as far as temperature is concerned Of many possible 
ways to describe the variability of a group, we shall discuss the three indexes 
most commonly used, the range, interperccntilc ranges, and the mean and 
standard deviation. 

RANGE 

Of these, the simplest and crudest method is the range. The range of a 
distribution of scores is defined as the difference between the highest score 
and the lowest score in the distribution. To see how the range is computed, 



STATISTICAL DESCRIPTIONS OF MEASUREMENT DATA’ 


151 


look at the raw scores presented in TableT|||j^fe we find that the highest 
score is 133 and the lowest 91. Therefore, b^roBnition the range is equal to 
42 (133 — 91). I'hc range, then, presents us with an approximate basis for 
comparing the variability of two distributions as in the case of the two cities 
discussed above. If the range of temperatures in one city is 93 and the range 
in the other city is 47, this indicates that the variation in the latter city is 
less. The range is only a rough indication of variability because it is con- 
ceivable that two distributions may have the same or nearly the same range 
and yet have different variabilities, as in Figure 32. 



Figure ^2. T>vo distributions having the same 
lange but dillcrcnt variability. 

In distribution A, the scores are scattered over the entire range while in 
distribution B most of the scores tend to cluster more closely around the 
center. Yet the highest and lowest scores are the same, making the ranges 
equal. 

INTLRPLRCFNTILE RANGES 

From the previous discussion, wc can sec that the range is a function of 
the scores at cither end of the distributions. In order to exclude these fe\\ 
extreme scores that determine the total range, a special type of range called the 
interpercentile range l.as been developed as a measure of variability. How- 
ever, before we can define an interpcrccntile range, wc need to know the 
meaning of a percentile. 

Since percentiles will be discussed in complet: detail in a later section 
dealing with measures of relative position, we shall develop here only what 
is necessary for an understanding of an interperccntilc range. Briefly, a per- 
centile is a score or point on a scale of scores below which a certain per- 
centage of the total number of scores he. If exactly 35 per cent of the tcHal 
number of scores are less in value than the score of 52, then the score 52 is 
called the 35th percentile and it is designated as Likewise, if .lohnny 
weighs 127 pounds and he is heavier than 80 per cent of his class at school, 
then his weight, 127 pounds, is the 80th percentile for that particular group, 
and IS designated as In general, if N per cent of the total number of 
scores or N per cent of the area of a frequency curve lies below a certain 
value, then that value is called the A^th percentile and is designated as /\. 
The 58th percentile, for instance, could be schematically presented as follows: 



152 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 



Certain percentiles are called by special names. The 5()th percentile, 
is called the median, which we have previously discussed. The 25ih percentile 
is called the lower quartile and is designated as Qi. Similarly the 75th per- 
centile is called the “upper quartile’’ and is designated as C?a- The 10th, 20th, 
30th percentiles are called 1 st, 2d, 3rd, . . . deciles respectively. 

The interpercentile ranine may now be defined as the range or dilTerencc 
between any two symmetrically placed percentiles above and below the 
median. The range from the 10th to the 90th percentiles, or perhaps 

the range from the 7th to the 93rd percentiles, would serve to indi- 

cate the spread of the scores and yet cut olT the few scores at either end of 
the distribution which may be erratic. The range from the 25tlT percentile to 
the 75th percentile, or in other words, from the lower quartile to the upper 
quartile, is called the interquartile range and if the interquartile range 


is divided by 2, — 


it is called 1;he quartile deviation. In any case, an intci- 


perccntilc range, interquartile range, or quartile deviation as a measure of 


variability is used in connection with the median as a representative score. 


MEASURES OF VARIABILITY BASED ON DLVIAIIONS 

The third method of indicating the dispersion of scores is based upon the 
deviations that the scores have from the mean. Since these deviations arc 
indications of distances of the scores from the mean, an average of these 
deviations would seem a natural way of indicating the spread of these scores. 

We have already observed that the algebraic sum (taking into account the 
plus and minus signs) of the deviations from the mean is always zero. So the 
matter of finding the average of the deviations is not without difficulties. Some 
method must be found of preventing the positive and negative deviations from 
canceling out and still keeping the relative sizes of the deviations intact. I'wo 
methods arc possible: to disregard the signs, or to square the deviations. The 
first method results in the mean deviation and the second method in the 
standard deviation. 

Mean Deviation. First, consider the method of disregarding the signs. 



153 


STATISTICAL DESCRIPTIONS OF MEASUREMENT DATA 

Whenever the sign of a number is disregarded, the result is the “absolute 
value” of the number. Thus the absolute value of —3 is 3 and likewise the 
absolute value of +3 is 3. The absolute value is symbolized by two vertical 
bars on either side of the number. | N | means the absolute value of N. Now, 
the mean deviation is cqurl to the sum of the absolute values of the deviations 
of the scores divided by the number of scores. This measure is not seen very 
often, but does provide a direct approach to indicating variability based on 
the deviations of scores. 

Standard Deviation. The second method, squaring the deviations in 
order to get rid of the signs, provides the basis for computing the standard 
deviation. If the deviations of the scores are squared, the mean of these 
deviations is called the variance and the square root of the variance is 

km^wn as the standard deviation (s). The following example outlines the 
computation of the standard deviation for 7 scores whose mean is 16. 


Steps to follow; 


Raw Scores Deviation 



X 

d^ X - 

A' r/-' 


1. Determine the deviation {d) of 

23 

-t7 

49 X = 

16 

each score from the mean. 

12 

->4 

16 N r:. 

7 


t) 

- 10 

100 


2 Square each deviation {d-) and 

30 

t 14 

196 


add the column, '^d- 

18 

r 2 

4 



8 

-8 

64 


3. Divide the sum ot the devia- 

15 

- 1 

1 


tion squared column b> the 


— 

— 


number of scores, 

the result is the variance, 

Variance 

- 0 

L> „ 

430 ^ :Sd2 

61.4 


4. lake the squaic root of the 

Standard dcsiation s — 


7.8 

variance and the lesult is 
the standard deviation, v. 



\ N y 



In engineering, the standard deviation is called the root mean square 
(rms) which is a more descriptive term since the standard deviation is the 
square njot of the mean of the squared deviations. In the example above, if 
the raw scores had a frequency other than 1, the formula for the standard de- 


viation would be .V = 



With this basic definition of the standard deviation, we can now turn to 
its computation frc»ni a frequency table. Here the procedure is essentially the 
same as the for the mean. First, we compute the standard deviation of the 


scores from the assumed mean, ^ 


N 


and then subtract the correction for the 



154 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


true mean . When this is carried out, the result then becomes 

/ VM\2 

s — if ^ \ . In case the deviations are in terms of interval units 

V N / 

/^fu- ( 1 /nY 

(m), then the formula becomes s — i y — I ^ 

of the interval. Table 7 outlines the computation of the standard deviation 
for scores grouped in intervals. 


TABLE 7 


An OutliFic for the Computation of the Standard Deviation 



Steps to follow: 

Interval f 

u 

/« 

iui 

1. 

Assume a mean v\hich is the 

oc 

0 

1 

oc 

3 

9 

27 


midpoint of an interval and 

75-79 7 

2 

14 

28 


set up the u column. 

70-74 5 

1 

5 

5 

2 

Multiply each / by its n and 

65- 69 1 1 

0 


0 


60-64 16 

55-59 10 

-1 

-2 

- 16 
-20 

16 

40 

add the column, 22///. 

3. 

Multiply each /// by its // 

50-54 8 

—3 

-24 

72 


and add the column, 

N=r60 

2/// . 

0 ‘ 

. -32 

il 

1 

4. 

Substitute the values in the 






formula for standard devia- 
tion, ^fu — "32, 

-V— - 

V N 

il 

ri 

/ 60 \ 

'-32 y 


22///^ rr- 188, ; = 5, N - 60. 

8 5 




COMPARISON OF STANDARD DCVIATION AND INTFRPLRCLN'I ILF RANGE 

Although the intcrpercentilc range and the standard deviation both give 
an indication of variability, we can see from the previous discussion that they 
arc different in nature. The intcrpcrccntile range represents an actual range 
or distance from one point in the distribution to another point, and the 
percentage of scores between these two points can be stated. For example, 
the interquartile range, which is a special case of the interpercentile range, 
establishes two scores, the upper quartilc and lower quartile, that include the 
middle 50 per cent of the distribution. The standard deviation, on the other 
hand, is simply a representative deviation of all the deviations of the scores, 
and, except in such theoretical distributions as the normal 'distribution, the 
standard deviation does not represent a range that includes a certain per- 
centage of the scores. Fhus the interpercentile range is a better descriptive 
measure than the standard deviation because it is more easily visualized and 
comprehended. 

It has already been noted that the interpercentile range should be used 



STATISTICAL DESCRIPTIONS OF IVffiASUltEMENT DATA 155 

in connection with the median and that the standard deviation should be used 
with the mean. Consequently, the interpercentile range is particularly adapt- 
able to measurement symbols that indicate rank position, and the standard 
deviation is limited to measarement symbols that indicate scale position. Like 
the mean, the standard o ‘\ ation is useful in the computation of further sta- 
tistical measures and in the area of sampling statistics. Iherciore, from the 
standpoint of statistical analysis the standard deviation is preferred over the 
interpercentile range. 

The interperccntilc range and the standard deviation are equally difficult 
to interpret. For example, suppose we calculated that the interquartile range 
of a distribution is 17 and that the standard deviation is 9. What do these 
measures mean? Standing alone, these measures mean very little, i hey can be 
interpreted only by comparison with other results in similar situations. If it 
is found, say, that the interquartile ranges for other similar distributic ns is less 
than 17, then we would suspect that our distribution is somewhat dispersed. 
In general, any judgment regarding the degree or extent of variability of a 
distribution is a relative matter. 

Describing the Relative Position of Individuals in Groups 

We turn now from describing groups to describing the individuals within 
the groups and in particular the relative position of the individuals. As was 
indicated in previous sections, any score or measure standing by itself has 
little meaning .lust what does Mary’s test score of 55 signify if that is all wc 
know? In this section we shall discuss certain statistical procedures able to 
convert individual raw test scores to measures that indiCiite more meaningful 
relative position. 

RANK ORDfR 

As usual, we shall start wath the simplest technique for indicating relative 
position, that of establi.shing rank order. This jnethod consists simply of 
arranging the raw scores in order of size and then of assigning a rank to each, 
the top score receiving Rank 1, and so on down the line. If a score is obtained 
by two or more persons, then the rank of the score is repeated the necessary 
times and the next score in line would have a rank which lakes into account 
the previous repetition. For example, if three persons received a score that 
ranks third, then the ranks would proceed as follow^s: 1, 2, 3, 3, 3, 6, 7, etc 
Note that the third rank is repeated the necessary three times, followed by 
rank 6, which indicates that the next score is 6th in line. Obviously, if the 
number of scores is large, the process of ranking will become unwieldy. 
Moreover, to know just the rank of a score is not sufficient because the same 
rank would have different meaning in groups of different sizes. For example, 
consider a rank of 1 5 in a group of 432 scores and then consider a rank of 
1 5 in a group of 20 scores 



156 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


PERCENTILt RANK 

To correct the limitations of simple rank order, another measure of rela- 
tive position has been developed, the percentile rank. Percentiles have been 
discussed previously in Chapter I in connection with measurement symbols 
that indicate order position, and earlier m this chapter in connection with 
interpercentile ranges Here wc shall discuss the percentile m connection with 

TABIF K 

An Outline for Developing a C umulali\c Pciccntagc Curve 


Steps to follov\. 


130- 134 
125-12^1 
120-121 
IIS 119 
1 10-114 
lOS 109 
100 104 
9S 99 
90 94 



C inuulcitis i 

C iimiilatne 


fttqm tiL \ 

pita ntdpL 

1 

90 

100 


89 

99 

6 

86 

96 

10 

80 

89 

14 

70 

78 This is the 

21 

S6 

6 — pcicentile 

19 

3S 

39 lank of 

11 

16 

18 109 5 

5 

5 

6 


V ^ 90 

lh)nn aiy point 
89 S 
94 S 
99 S 
104 5 
109 S 
114 3 
119 ^ 

124 S 
129 
134 5 



1 Start at the bottom of 
the frequency column and 
begin adding the fiequcncjcs 
one at a lime, iccord- 

ing the lesult of each 
addition in the cumulative 
frcqucnc> column 

2 Divide each number in the 
cumulative ficquency column 
by A “ 90 and recoid the 
result m the cumulative 
percentage column Tluu 
each numbei in the cumula- 
tive peicentage column 
represents the percentile 
rank of a bound ir> point 

3 Set up a table of boundai> 
points and pciccnlilc lanks 
and plot from this table the 
cumulative percentage curve 


0) 

100- 

O) 

D 

90- 

c 

80- 

0) 

u 

70- 

Q- 

60- 


50- 

> 

□ 

40- 

"5 

30- 

E 

D 

20- 

u 

10- 


Ft u I niih tank 
0 
0 
18 
39 
62 
78 
89 
96 
99 
100 


k iw <L rt 



STATISTICAL DESCRIPTIONS OF MEASUREMENT DATA^ 157 

determining the relative standing of an individual raw score in 3 . group of 
scores. 

The procedure for finding the percentile rank of a raw score in a small 
group of scores is simply that of finding the percentage of the total number 
of scores that fall below the given raw score. Suppose, for example, out of a 
total of 34 scores, 14 scores fall below the raw score 67, or approximately 
41 per cent of the scores fall below 67. Then the percentile rank of the raw 
score 67 for this particular group is 41, which is designated as P41. 

Finding the percentile ranks of raw scores in large distributions grouped 
in intervals is a more involved process. In this case we can employ a bit of 
strategy to save ourselves prolonged computations. If you will look closely at 
a frequency table, you may notice that the computation of the percentile rank 
of a boundary point between two intervals is a very simple and direct proce- 
dure. For any boundary point, the percentile rank is found by adding the 
frequencies of the intervals below the boundary point and dividing the sum 
by the total frequency. Now, after finding the percentile rank of each boundary 
point, we have enough data to plot a graph from which the percentile rank 
of any score in the distribution may be determined 01 course this graph has 
a special name. It is called a cumulative percentage curve or an ogive. Tabic 8 
presents an outline of the steps leading to the plotting of a cumulative per- 
centage curve from which percentile ranks of individual law scores may be 
found. 

From the ogive not only can the percentile rank of each raw' score be 
found but also the median, the quartilcs, and the deciles, and these figures 
provide the basis for computing the interquartile range or any intcrpcrccntilc 
range. If the curve is carefully plotted on graph paper, the readings can be 
sufiicicntly accurate for mo^t purposes. 


STANDARD SCORE 


The third and final measure of a relali\e position to be discussed here is 
the standard score. The standard score ol a ceitain raw score is ctjual to the 
deviation of the raw score from the mean divided by the standard deviation. 
Ordinarily the letter :: is used to refer to the standard score and thus we often 
see the standard score referred to as the score. The tormuia for the standard 
score is as follows: 


standard score (z) “ 


X- X 


V 


Where X --X is the deviation of the score from the mean and s is the standard 
deviation. 

Suppose that we wish to find the standard score of the raw scores 52 and 
40 where the mean and standard deviation of the distribution containing them 
arc: JP 43 

.V ~ 6 



158 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


then z — -f-1.5 for the raw score of 52 

6 6 

and z = — ~ -*5 for the raw score of 40 

6 

Thus the standard score expresses the relative position of the raw score 
by indicating the number of units of standard deviation between the score and 
the mean. A standard score of r = -f 1.5 would indicate that the score is one 
and a half standard deviations above the mean. A standard score of z =:^ - .5 
would show that the score is half a standard deviation below the mean. Stand- 
ard scores larger than — -[-3 or less than ^ =r - 3 are not likely to occur 
unless the distribution is badly skewed. 

COMPARIvSON or PFRCrNniF RANKS AND STANDARD SCORES 

Both the percentile ranks and standard scores are widely used as measures 
of relative position, with each having its own particular advantages and dis- 
advantages. The percentile rank is readily understood by most of us at first 
encounter, but percentiles are not evenly spaced throughout the distribution 
and they arc of no use for further statistical computations. Standard scores 
arc evenly spaced throughout the distribution and they are used in further 
statistical work. I hey have the disadvantage of not being easily understood. 
Moreover, their use is limited to unit-scale raw data and it is difficult to in- 
terpret them for markedly skewed distributions wSo, once again, the selection 
of the type of measure to be used depends upon the purpose of the user and 
the nature of data at hand. 

NORMS 

Up to this f)omt, nothing specific has been said about the groups ol 
students upon which all these statistical manipulations have been made. It 
has been herctolorc tacitly assumed that the raw scores came from any 
arbitrary group that the teacher may encounter 1 his is not always the case, 
however. Attempts ar^ often made to obtain raw scores from carefully selected 
groups that are to represent a larger population. Using various sampling tech- 
niques, for example, a group of 200 students may be selected, which is repre- 
sentative of all the fifth-grade students in a particular region. I he arithmetic 
means, medians, percentile ranks, and standard scores computed from raw 
data obtained from such representative groups are called norms. 

The interpretative value of having norms available is evident and has al- 
ready been indicated in chapters 1, 3, and 6. Without norms, a teacher can 
compare the performance of a student only with the other members of his 
local class. With norms, however, a teacher has a much broader basis for 
comparison. 

There are two basic methods of relating an individual score to a norm. 

1. A sequence of norms may be established representing a gradation of 



STATISTICAL DESCRIPTIONS OF MEASUREMENT DATA 159 

level, and the individual score is matched with the closest horm to determine 
the level of performance or status indicated by the score. 

2. The relative standing of the individual score is determined directly with 
respect to the representative group. 

The first method has led to the development of age and grade norms, and 
the second method is the basis for establishing percentile rank and standard 
score norms. 

Age and Grade Norms. These norms are used whenever the variation of 
a dimension is closely related to age or grade in school. Such dimensions 
would be achievement in various subjects, height, intelligence, reading com- 
prehension, and vocabulary. 

Age norms are often used lor height. The mean heights of five-yea r-old 
boys, six-year-old boys, and so on, are computed, thuv establishing a sequence 
of height norms for boys by age groups. The height ol a boy then can be com- 
pared with these norms. For instance, it may happen that a five-year-old boy 
has a height topical of six-year-old boys. Mental age is another example of age 
norms. An intelligence test is administered to various age groups and the mean 
or median is computed for each age level. These mean or median scores on 
the intelligence test become age norms or mental age scores. Consequenil> , 
the score of an individual pupil may be used as a basis for determining his 
mental age. 

Scores on various achievement tests and reading tests are often converted 
to norms based upon grade level in school, in the same fashion a.s for age 
norms. Means or medians arc computed for each grade-level group of stu- 
dents and thus are called grade norms. T he score of an individual pupil on one 
of these tests is then compared with the grade norms to determine the grade 
level of his performance. 

Percentile Rank a fid Standard Score Norms. These norms are obtained 
directly from a single representative group. A test is administered to this 
representative group, and the percentile ranks or standard scores computed 
for this group become the percentile norms or standard score norms as the 
case may be. For this method, the individual score of a student on a test is 
converted directly to relative standing in a group typical of the total popula- 
tion of which the student is a member. As an example, the score of a fifth- 
grade pupil on a reading test is converted to a percentile rank or standard 
score in tenns of the total population of fifth-grade pupils. This procedure is 
different from the first procedure, where the pupil's score was used as a basis 
for determining the group of which the pupil would be typical. 

Tests developed on the basis of data obtained from groups of students 
that are typical of much larger groups of students are commonly called stand- 
ardized tests, (See Chapter 6, pages 121-126) Hence, by the very nature of 
their construction, all standardized tests should provide norms, either of the 
age and grade type or of the percentile and standard score type. Ordinarily 
these norms are found in tables contained in the test manuals where there is 



160 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


provided a column of raw scores and alongside is a column of their norm 
equivalents. If a test is appropriate to more than one group, we should expect 
to find a separate set of norms for each group. 

Precautions in the Use of Norms. In the interpretation and use of norms, 
certain precautions should be kept constantly in mind. 

1. Norms do not represent what is ideal Norms arc useful only in describ- 
ing the level of performance of a student. In other words, a norm is a meas- 
urement symbol and not an evaluative symbol (see Chapter 9, pages 190- 
195). 

2. Norms do not indicate that a pupil should be advanced or retarded to 
the age or grade group in which the pupil’s score is typical. If the height of a 
seven-year-old child is typical of the height of six-year-olds, this does not 
mean that the child should be six years old rather than seven yeais old. If a 
fourth-grade pupil’s achievement level in a certain subject is typical of sixth- 
grade pupils, this does not necessarily mean that this pupil should be advanced 
to the sixth grade. The high level ol performance ol this pupil may be due 
primarily to superior mastery of fourth-grade material and not to ability to do 
sixth-grade work 

3. The problem of obtaining a truly representative sample ol a given 
population is so great that it seriously hampers the valid use of norms Test 
experts arc now more aware than ever of the inadequacy of present methods 
in obtaining representative samples lor establishing test norms^ Consequently, 
teachers should be extremely cautious in applying norms to their students 
Teachers should at least closely examine the description of the group on 
which the test \\as standardized to see if it is comparable to the group they arc 
working with. 

4. The teacher should make certain, before aj^plymg noims, that the test 
is not in any way unfair to the particular group at hand It has happened that 
a certain test tends to discriminate against certain eulture groups or contains 
items on topics which were omitted by the teacher in ckiss 

5 Although norms are based upon age, grade, and standard scores which 
arc scale position symbols, norms do not indicate a unit scale of performance 
and therefore should not be treated as scale position symbols. At best, norms 
are still measures ol relative position 

Summary 

In this chapter we have presented some basic descriptive statistical opera- 
tions. The significance of the operations is to organize raw measurement data 
into more meaningful data, thus to accomplish or further facilitate precise 
appraisal of the phenomenon in question T hese statistical concepts and pro- 
cedures have these essential purposes : 

1. To provide an over-all picture of the group pattern 

2. To provide a representative score for a group of scores 



STATISTICAL DESCRIPTIONS OF MEASUREMENT DATA 


161 


3. To provide a measure of the variability of a group of scores 

4. To provide a measure of relative position lor individual raw scores 

In connection with the first purpose, visualizing the group pattern of 
scores, the following statistical devices were discussed: {a) The arrangement 
of the scores in order of size, (b) the establishment of score intervals leading 
to the setting up of a frequency table, (r) the construction of a bar graph 
called the histogram, and (d) the plotting of the frequency polygon, which 
can be used to approximate a smooth frequency curve. The second purpose, 
providing a representative score, led to a discussion of the mode, median, and 
mean. The measures of variability presented were the range, the interpercen- 
tile range, the interquartile range (which is a special case of the interpcrcentilc 
range), the mean deviation, the variance, and the standard deviation. Finally, 
as measures of the relative position of an individual in a group, we examined 
rank order, percentile rank, and the standard score. Norms based upon repre- 
sentative groups are a lurthcr aid in interpreting the relative positions of 
individuals. 

F'or each of the four purposes, the available statistical measures range in 
refinement Irom crude approximations to precise and refined expressions As 
these statistical devices are encountered and used, it is important that we have 
in mind their limitations, their special purposes, and the assumptions that 
underlie them. 


LXERC ISES 


1. 68 

75 

78 

77 

82 

80 

71 

90 

86 

68 

76 

78 

73 

88 

85 

72 

78 

67 

80 

70 

61 

77 

71 

76 

93 

76 

76 

66 

7^ 

76 

86 

60 

89 

73 

84 

81 

84 

87 

66 

8i 

81 

84 

73 

64 

79 

77 

88 

82 

52 

72 

75 

77 

74 

82 

79 

71 

75 

82 

77 

SO 


Using the above raw scores 

a. Construct i frequency table 

b. Construct a hislogiam and Irequcncy polygon 

c. Compute the mean. 

d. Compute the median. 

e. C'ompute the standard deviation. 

f. Compute the interquaitilc range. 

g. Draw an ogive. 

2. What criteria ^\ould you use lor judging the suitability of a frequency table? 

3. What statistical techniques are appropiiate for use with raw scores obtained 
from teacher-made tests? 

4. Give examples of the misuse of statistics in educational measuicment. 



162 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

5 Examine the manual of a standardized test and summarize the procedures 
used in establishing noims for the test. Make a critique of these methods. 


BIBLIOGRAPHY 

1 Adams, J K, Basic Statistical Conctpts New York McGraw-Hill Book Co, 
195-^ 

2 Clark, Charles L , An Introcluctum to Statistics New Yoik John Wiley & Sons, 
Inc, 1953 

3 Croxton, 1 1 , and Cowden, D J, Applied General Statistics New York 

Prentice Hill Inc, 1940 

4 Freund, John E , Modern / hmentaty Statistics New York Prentice Hall, Inc , 
1952 

5 Guilford, J P, lundanuntal Statistics in Psycholoi^y and Education New 
York McGriW Hill Book Co, Inc, 1950 

6 Johnson, P O , Statistical Methods in Reseanh New Yoik Prenti^^c Hall, Inc , 
1949 

7 Spiowls, R C , Llenientan Statistus for Students of Senial Science and Busi- 
ness New Yoik McGiawHill Book Co, 1955 

8 W ilker, Helen W, Lhnuntary Statistical Methods N^w York Henr> Holt A 
Co , Inc , 1943 

9 Wilks, S S, Llenientan Statistual Anahsis Piinccion Princeton llni\crsit> 
Press, 1951 



CHAPTER 8 


FURTHER STATISTICAL CONCEPTS 
IN MEASUREMENT 


In the preceding chapter, we discussed certain basic statistical concepts 
that describe in various ways the measurement data we may have, and we 
developed outlines for computing these descriptive devices. In this chapter 
we shall consider a few' additional statistical concepts often encountered in 
educational measurement, which stem from two important statistical terms, 
the normal probability curve and the correlation coefficient The normal 
probability curve is often called the ‘‘normal curve” and is basic to any con- 
sideration of such topics as sampling, confidence limits, and standard error, 
discussed later in this chapter. The correlation coefficient describes the co-rela- 
tionship between paired measures — the degree to which one variable is related 
to another variable. The correlation coefficient is basic to an understanding of 
the technical meaning of reliability and validity as used in measurement, which 
also will be discussed later. The compulation of these statistical measures will 
not be outlined. Only the definitions of the terms will be developed and some 
discussion of their application will be made. 

Normal Probability Curve 

As its name implies, the normal probability curve is based upon the idea 
of probability. Therefore, any definition of this curve must first start with some 
basic notions of probability. After having lived awhile, our intuitive concep- 
tion of probability has been pretty well formed. Our experience has shown us 
that certain situations involve more risk than other situations. The probability 
of our being involved in an auto acudent on a road with heavy traffic is much 
greater if we were driving at a speed of 80 miles per hour than if we were 
driving at 20 miles per hour. We also know that the probability of our living 
for the next ten years generally decreases with our age. 

The above examples, however, are rather complex, so let us confine our- 
selves to situations where the probabilities arc easily computed. For instance, 
suppose wc have a container that has in it 30 white balls and 20 black balls, 
what is the probability of drawing a black ball? The answer would be ob- 
viously, 20/50 or 40 per cent. If we threw a single die, what is the probability 

163 



164 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


of throwing a 3? Since only one of 6 possible ways for a single die to turn up 
would be a 3, the probabilities of throwing a 3 is % or 16% per cent. 

THE PROBABILITY RATIO 

Probability is generally expressed as a ratio, and we shall formalize the 
process of determining probability as follows. If of n equally likely possibilities 
m of these are favorable to the happening of a certain event, then the prob- 
ability that the event will happen is the ratio m/n. 

Another simple example of probability is that of coin-tossing. Suppose we 
have three coins, what is the probability of tossing two heads? To answer this, 
we must first consider all the '‘equally likciv” possibilities in which three 
coins may fall, fhey are enumerated as follows: H (heads), T (tails) 


1. H H H 
^2. H H T 
H T H 
4. H T T 


*5. T H H 

6. T H T 

7. T T H 

8. T T 1 


There is a total of eight “equally likely” ways in which three coins may fall 
and of these there are three ways (marked by an asterisk) in which the result 
will be two heads showing. 'I heii by definition, in the tossing of three coins 
the probability of two heads showing is % Likewise, the probability of getting 
three heads is Vs, of getting one head is and of getting no heads is Vs. 
Note that these are expected probabilities. The expected probabilities of num- 
ber ol heads appearing for different number of coins are shown in Fig- 
ure 33. 


N iiiriljc I 
of 


f/t 

a un tu nutnorr 

nj Ilf lid ^ 

afjfjfajiTiK 



(Oltis 









lossi d 

0 

1 

- 


1 


6 

i 


J 

1 

6 

1 

1 




4 

lb 

lb 

lb 

lb 

lb 





J_ 

0 

K) 

20 

11 




6 

64 

64 

(yl 

bl 

61 

6t 

64 



1 

8 


^.b 

70 


28 

_ 8 _ 

8 

L\'.6 

5i5() 


2^b 

'Jjb 

250 

2. 50 

250 


1 

16 



1 .820 

4:ib8 

8008 

1 1,410 

lb 


(>^.,.'■>‘18 

(i'i,*i3b 

t»r*.‘'j3b li 



65.5S6 

6.5.5 !4b 


Figure 33, Piobabihtics in coin tossing. 

The figures shown are the frequency polygons for the various numbers of 
coins, showing the distribution of the expected probabilities for the different 
number of heads appearing. 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


165 




Number of Heads Number of Heads 



Number of Heads 




Normal Probability Curve 


Since the values of the expected probabilities may be computed from a 
formula based on ihe expansion of the binomial (1 2 4- Vz )" where n is the 
number of coins tossed, these frequency polygons are called binomial dis- 
tributions. 

DEHNIIION OF IHE NORMAL PROBABILITY CURVE 

In the binomial distributions, it is noticed that the shape of the distribu- 
tions become smoother and more bcll-shapcd as the number of coins increases. 
This observation leads us to a descriptive definition of the normal probability 
curve. The normal probability curve is the ultimate shape of the distribution 
of expected probabilities of number of heads showimt when the nuffiber of 
coins tossed is increased without limit. 

14ie purpose of this brief development has been to illustrate how the 
normal probability curve is related to probability. One can readily see that it 
is a very special curve. It is not a curve obtained from measurement data like 
the curves obtained in the previous chapter. It is a theoretical distribution, 

represented by the mathematical equation y — — L_ ^ ^ Its usefulness lies in 

the special properties it has and in the fact that variations of certain phenom- 



166 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

ena have been found to approximate closely the shape of a normal probability 
curve. Unfortunately, this curve has been ill-used in educational measure- 
ment owing to misunderstanding of the theoretical nature of the curve and it 
is for this reason that special attention is being given here to the normal prob- 
ability curve. 

PROPPRTIES OF THE NORMAL PROBABILITY CURVE 

Since the normal probability curve is a probability distribution, its area 
then represents probability. This is its special property, which makes it such 
a useful theoretical curve. First, take a brief look at the normal probability 
curve in its standard form as shown in Figure 34 Wc note that the z scale or 



Figure M. Noimai ciiivc 
and r-scalc iclationship. 


Standard scores, which were discussed in the previous chapter, are used. Zero 
is placed at the mean and the scale goes to the light and (elt of the mean in 
terms of standard deviation units. 1 hus, on this scale, 2 means two standard 
deviations below the mean, and f 1 means one standard deviation above the 
mean. You will note that three standard deviations above and below the mean 
cover pretty much the scope of the curve 

Area — z Score Relationships. With the normal probability curve in 
standard form, the portions of areas between any pair of 7 scores is also stand- 
ardized. For instance, it has been computed that 68 per cent of the total area 
under the curve lies between the 7 scores -f 1 and -J, and further that 95 
per cent of the total area lies between the z scores 2 and —2 This is illus- 
trated in Figure 35. 




Figuie 35. Portions of area of normal distribution contained between 
given z scores. 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 167 

Area percentages between other pairs of z scores can be found by consulting 
a table for which these values have been computed.^ 

z Score Probability Relationships. The relation between portions of areas 
under the curve and probability can perhaps best be shown by an example. 
Suppose we assume that the distribution of heights of men in the United States 
approximates the normal probability curve. Studies have shown that this is 
not an unreasonable assumption. We shall take as the mean 69 inches (5 ft. 
9 in.) and as the standard deviation 3 inches, which are close to the results 
reported by these studies. Now we are in a position to make certain state- 
ments of probability. 

1. Subtracting three inches (one standard deviation) from, and adding 
three inches to, the mean of 69 inches gives us the interval 66 inches (5 ft. 6 
in.) to 72 inches (6 ft. ) and in this interval, 66-72, lies 68 per cent ol the dis- 
tribution. Hence, if we choose a man 
at random, the probability is .68, or 
the chances are 68 out of 100, that his 
height is somewhere between 66 
inches (5 ft. 6 in.) and 72 inches 

(6 ft. ). _3 _2 -1 0 ■►I ■►2 ■*•3 Z'Seale 

60 63 66 69 72 75 78 Height scale 



Mean - 69 inches 
S.D = 3 inches 


2. If we subtract and add two standard deviations (6 inches) to the mean, 
w'c get the interval 63 inches to 75 inches, which contains 95 per cent of the 
normal probability distribution. If wc 
choose a man at random, the prob- 
ability is .95, or 95 chances out of 
100, that his height is somewhere 
between 63 inches (5 ft 3 in.) and 

75 inches (6 ft. 3 in.) . .3 

78 Height scale 

3. Since 16 per cent oi the normal probability distribution lies above the 
z score -f 1, then we can also make the following statement: The probability 
is only .16, or 16 chances out of 100, 
that a man’s height is over 72 inches 
(6 ft.). We might also say that tlic 
probability is .84, or 84 chances out 
of 100, that a man’s height is less than 
72 inches (6 ft. ). 




Mean = 69 inches 
> D. - 3 inches 



-3 

60 


-2 

63 


-1 

66 


0 

69 


^2 
72 75 


^3 Z“Scale 
78 Height scale 


This same general approach may be applied to other situations involving 
the normal probability curve. For instance, studies have shown that the dis- 
tribution of IQ’s based on the Stanford-Binet scale approximates the normal 


1 See appendix, page 478. 



168 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


probability curve. I’he mean IQ has been set at 100 and the standard devia- 
tion is equal to 16.- As shown in Figure 35, we can relate the normal prob- 
ability curve to the distribution of JQ’s and make certain probability state- 
ments. 

1. The chances are 68 out of 100 that a person’s IQ is somewhere be- 
tween 84 and 1 1 6. 

2. The chances are 16 out of 100 that a person’s IQ is above 1 16. 


Mean = 100 
S.D. - 16 



~3 -2 -1 0 +1 ■♦■2 +3 2 scale 

52 68 84 100 116 132 148 IQ scale 


PigLirc 36 Probability of a person 
having a given IQ. 

From these few simple examples we can see how probability can be ob- 
tained from the normal probability curve Likewise, it becomes more obvious 
as to why it is desirable to work with distributions ol dalti that approximate 
the curve of normal probability. As a matter ol fact, the desirability of using 
the normal curve is so strong that it is often used when it should not be. This 
leads us to the question, for what types of measures can we reasonably assume 
the curve of normal probability to be applicable? 

Applicability and Use of the Normal Probability Curve. Clues for an- 
swering the question concerning when the normal probability curve is ap- 
plicable may be obtained by referring to the example of tossing coins. This 
example was used to explain the theoretical development of the normal curve 
earlier in this chapter. There were certain characteristics ol this example that 
we might point out at this time 

1. There had to be a suflicicnt number of coins used before the distribu- 
tion of number of heads approximated a normal distribution. 

2. Each coin was independent of the other In other words, w'hether a 
certain coin came up heads or tails did not depend in any way upon w'hether 
another coin came up heads or tails. 

3. Each coin had equal iinpoitancc in the final result. No one com had 
more weight in determining the number of heads showing than any other coin. 

4. Finally, for each coin there is a 50-50 chance that he ids will show. 

Such considerations are involved in why the distribution of the number of 

heads for tosses of coins resulted in a normal probability distribution. In gen- 
eral, the normal curve is applicable to distributions of measures that are the 
result of several definable factors, each of which is independent, has equal 

-See Terman and Merrill, Measuring /ntclligencr, Boston: Houghton Mifflin Co., 
1937, pp. 33-51. 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


169 


weight, and for which there is a 50-50 chance of the factor’s being present 
or not. 

With this background, let us see which human traits arc most likely to 
result in distributions approximating the curve of normal probability. Because 
of the “gene theory,” we would expect a normal probability distribution, for 
hereditary Mcndelian traits in humans provided there is no selective factor or 
special environmental condition affecting the trait. I he genes involved for 
each of these traits would be analogous to the coins in the previous example. 
Figuratively speaking, each person represents a “toss” of the genes with some 
genes dominant and others recessive, thus fixing the trait in question for a 
particular person. With the “gene theory" the four conditions mentioned 
earlier are satisfactorily met and we could expect normal distributions for 
such traits as height, head size, nose shape, eye color, length of car lobes, 
and various measures of skeletal structures. It is important to stress again the 
fact that no selective factors or no special external environmental conditions 
should be operating. For example, the distribution of weights of persons 
would not likely be a normal probability distribution because their weights 
arc subject to conscious control. 

Much has been said about intelligence being normally distributed. Ac- 
cording to our previous discussion, we would expect a normal probability dis- 
tribution only if intelligence is a Mcndelian trait, innate and not subject to 
voluntary or external control. Getting at innate intelligence has been the chief 
stumbling block for intelligence testing and there is still a long way to go (see 
Chapter 14). It was mentioned earlier that the distribution of IQ's for the 
Stanford-Binet scale resulted approximately in a normal probability curve. 
This was due to a caret ul selection of items in the test so that the total result 
would be a normal curve and not because the test was getting at innate in- 
telligence. Achicvenieiil of pupils in school subjects is also believed to be 
normally distributed. By now, it should be obvious that this would be likely 
only under the most ideal circumstances, with each pupil working to full 
capacity and environmental conditions fulK equaled. 

Sampling and Error 

1 he most important applications of the normal probability curve are in 
connection with sampling and error. The curve was first encountered by 
physical scientists in the foim of random errors of measurement and to them 
it was known as the “curve of error” (7:166). They observed that when dif- 
ferent persons took measurements of the same object, the results were not 
always the same. When several of these measurements were plotted, the result- 
ing curve approximated what is now knowm as the normal probability curve. 

STANDARD ERROR OF A MEASURIMENT 

The standard deviation of this particular normal probability curve or 
“curve of error” is given a special name, the standard error. The standard 
error is often encountered in educational and psychological measurement; 



170 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


therefore, let us consider a specific example to show how it is used. Suppose 
we have administered a standardized intelligence test to a student and deter- 
mined that his IQ is 112. We could stop here and consider this to be his 
true IQ. However, we have an uneasy feeling that if we were to administer a 
different but equivalent form of the test to him, his IQ would not be exactly 
the same. 

Now, suppose we had at hand several equivalent forms of the test that 
we could use to retest the student many times, what would the distribution of 
his IQ’s be? On the basis of the experience of physical scientists, we could 
reasonably expect the distribution to approximate the normal probability 
curve. Now this is where the standard error comes in. The standard error 
would be the standard deviation of this theoretical distribution of IQ’s ob- 
tained by retesting the student several times. If we were to look up the test 
manual for this particular intelligence test, wc would find the standard error 
given. Let us say that the standard error is 4, since this is close to what is 
reported for most tests. We are now in a position to interpret the student’s 
IQ of 112 with the aid of Figure 37 



100 104 108 112 116 120 124 IQ scale 

Figure 37. Standtiid crroi of an 
IQ illuslialcd. 

1. The chances are 68 out ot 100 that the student’s ‘‘true IQ” is some- 
where between 108 and 116. 

2. The chances arc 95 out of 100 that the student’s “true IQ” is some- 
where between 104 and 120 

As indicated in Figure 37, the student's IQ ot 1 12 is placed at the center 
of the probability curve. Subtracting and adding the standard error ot 4 units 
to the 112 IQ gives us the interval ot 108 to 1 16, which includes 68 per cent 
of the area of the probability curve. Similarly, two standard errors above and 
below the 112 IQ results in the interval 104 to 120, which encloses 95 per 
cent of the area. Both of these intervals are called “confidence intervals,” and 
they serve as a basis for estimating true values. For the first interval, 108 to 
1 16, we would be sure 68 per cent of the time that this interval would contain 
the true IQ; thus it is called the 68 per cent confidence interval. Likewise, the 
second interval, 104 to 120, is called the 95 per cent confidence interval. 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


171 


APPLICATION OF STANDARD ERROR TO INTERPRETATION OF TEST SCORES 

The Standard error of a score, then, serves as a basis for establishing con- 
fidence intervals.^ Scores on all standardized tests should be interpreted in this 
manner and should never be thought of as fixed scores. In the physical 
sciences where the measurements are much more precise and accurate, the 
practice of providing an interval estimate is generally followed. In the test 
manuals of many standardized tests this very important standard error of 
score is provided and it should be employed more often than is generally done. 

One place where the standard error of a score is particularly important 
is in connection with score profiles. A score profile is a graph of diflerent test 
scores for an individual student that are expressed in comparable units. Sup- 
pose a student has been administered the following tests: Vocabulary, Read- 
ing comprehension, Quantitative problems, Spelling, and Social con‘^epts. The 
results of these tests can be portrayed by the following score profile. 


Test 


-3 -2 5 -2 -15 -1 


Standard Scores 
-5 0 51 


15 2 25 3 


Vocabulary 

Reading Comprehension 
Quantitative Problems 
Spelling 

Social Concepts 

Score profile 

In the interpretation ol this profile, we are concerned primarily with any 
differences to be found in the special abilities of the individual. Thus, the 
question of how significant arc the differences in scores becomes important. 
And if we arc to judge from the profile shown we may tend to overestimate 
their significance. 

In Figure 38, the confidence intervals provided by the standard errors 
of the scores are indicated. The broad bars represent 1 standard error above 
and below the observed score or the 68 per cent confidence interval; and the 

•^Standard errors may be computed for the mean and standard deviation also They 
arc called respectively the standard error of the mean and the standaid erroi of the stand- 
ard deviation. The computation of these standard eirors can be found in any statistics 
textbook. The interpretation of these standard errors is the same as the standard error of 
a scoie. For instance, the standard error of a sample mean is used to establish confidence 
intervals for the true mean 




172 FUNDAMENTAL CONCEPTIONS AND PROCEDURES 

lines extend over 2 standard errors, representing the 95 per cent confidence 
interval. 


standard Scores 

-3 -2 5 -2 -15 -1 -5 0 .5 1 


1.5 2 25 3 


Test 


Vocabulary 

Reading Comprehension 
Quantifative Problems 
Spelling 

Social Concepts 



Figure 38. A test profile that includes an indication of standard errors of the scores. 


The intervals in the profile shown in Figure 38 give a truer picture of the 
differences between the scores. Now it is apparent that the difference between 
the student’s vocabulary score and his reading score is insignificant; whereas 
the difference between his score on quantitative problems and his spelling 
score is shown to be highly significant. We know this because the confidence 
intervals for vocabulary and reading overlap while those for quantitative 
problems and spelling do not. 

Drawing such bars on each profile might be too laborious a process where 
large numbers of students are involved. But if the bars are not drawn, an 
image of these intervals should be held in mind. In educational measurement, 
reading an individual score is like reading a meter that has a vibrating pointer. 
At best, wc can only establish with a certain amount of confidence an interval 
which we believe contains the true score. 

Correlation 

We turn now to another important statistical concept in educational meas- 
urement, the correlation coefficient. Fundamentally, the correlation coefficient 
is used to describe the extent to which the variation of one set of measure- 
ments of a variable is accompanied by the variation of a set of measurements 
of another variable or dimension. One of our human activities in which we 
are ceaselessly engaged is that of associating the variatioi. of one variable 
with the variation of another variable. It is one of our primary methods of 
getting along in our environment. At an early age, the child is able to associate 
certain facial expressions of his parents with their probable behavior. He is 
also able to associate certain noises with certain activities that go on around 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


173 


the home, and the more associations he makes, the better he is able to func- 
tion effectively at home. In education we are constantly studying the correla- 
tion of different variables such as interest, attitudes, achievement, and IQ so 
that we may function more effectively as teachers. 

Concomitant Variatiorh In the remaining pages of this chapter, wc shall 
constantly refer to variations of variables as being “accompanied by” or ‘‘as- 
sociated with” variations of other variables. I’his type of variation is often 
called “concomitant variation.” Wc use such phrases deliberately in order to 
avoid any possible interpretation of a “cause-and-effcct” relation. The fact 
that the variation of one variable is accompanied closely by variation of an- 
other variable does not necessarily mean that the first variable is the cause 
and the second variable is the effect. The correlation coefficient is not involved 
in causal relationships but only with concominitant variation. This will be re- 
emphasized more meaningfully later in the chapter. 

Detune of Concomitant Variation, Our experience in associating vari- 
ables has indicated that some variables are more closely associated than 
others. For example, we would suspect that there would be very little associa- 
tion between a youngster’s school achievement and the number of letters in 
his last name. On the other hand, wc would suspect that there is a closer 
association between a youngster’s school achievement and his 10. There are 
some variables that are almost perfectly associated, such as the time of day 
and the position of the sun in the sky. The primary function of the correla- 
tion coefficient is to indicate the degree of closeness of the association or 
concomitant variation of two variables or dimensions. 

Direct or Positive Concomitant Variation. The other function of the 
correlation coefficient is to indicate whether the concomitant variation of two 
variables is direct or positive or inverse or negative. A direct or positive con- 
comitant variation or correlation is the case whenever an increase in one 
variable is accompanied b\ an increase in the other variable and, likewise, 
whenever a decrease in the first variable is accompanied by a decrease in the 
second variable. An example of this type of association is depth and water 
pressure, because an increase in depth of water is accompanied by an in- 
crease in water pressure. Other examples of variables that have positive 
correlation are height and weight, 10 and school achievement, ability in 
mathematics and achievement in science courses, \ocabulary and reading 
comprehension. 

Inverse or Negative Concomitant Variation. Other variables are in- 
versely associated or have a negative correlation. In this case, an increase in 
one variable is accompanied by a decrease in the other variable, or a decrease 
in one is accompanied by an increase in the other. An example of this type 
of variation is altitude and atmospheric pressure, an increase in altitude being 
accompanied by a decrease in atmospheric pressure. Likewise, in flying non- 
stop from Chicago to San Francisco, an increase in speed is accompanied by 



174 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


a decrease in the time it takes; thus, in this case, there is a negative correla- 
tion between speed and time. Other pairs of variables having negative cor- 
relation are dominance and submissiveness, TV reception and distance from 
transmitter, ability to memorize digits or nonsense syllables and age after 
twenty years. 

The Correlation Coefficient, r. On the basis of the above discussion, the 
correlation coefficient must indicate the degree of closeness of the correlation 
between two variables ana the direction of correlation, whether it is direct or 
inverse. Generally, the letter r is used to denote the correlation coefficient and 
the degree of closeness is indicated by a decimal scale ranging from 0 to 1 
with 0 indicating no concomitant variation wh.itsocver and 1 indicating perfect 
concomitance. Intervening decimals such as .23, .57, .82, would indicate 
varying degrees of concomitance with r — .82 indicating a closer association 
than r = .57. The direction of the correlation is indicated simply by a -f or 
— sign, where -f indicates a positive or direct correlation and — indicates 
a negative or inverse correlation. An r = —.82 would still indicate a closer 
association that an r = -|-.57 but the directions of the corresponding varia- 
tion would be different, r = — 1.00 would indicate a perfect inverse con- 
comitant variation. 

The Scatter Diagram. Before we develop a definition of the correlation 
coefficient, we need a visual portrayal of the relationship between two vari- 
ables. This is provided by the scatter diagram. 

The construction of a scatter diagram will be discussed in detail. However, 
the primary purpose here is to provide a figurative development of the nature 
of correlation and concomitant variation rather than to enable one to compute 
the correlation coefficient. 

In order to set up a scatter diagram for two variables, we need first to 
have paired measures of the two variables. Suppose, for example, we wish 
to develop a scatter diagram for vocabulary and reading comprehension. We 
need not only measures of vocabulary and reading comprehension, but we also 
need to have these measures paired on the basis of individual students where 
each student has a vocabulary score and a reading comprehension score. The 
next step is to set up a vertical axis and a horizontal axis. On the horizontal 
axis we lay off a scale for one of the variables and on the vertical axis, a scale 
for the other variable. Sometimes it is desirable to have the.se scales laid off 
in terms of score intervals (see pages 1 30-1 33). The vertical and horizontal 
scales are used then in plotting the pairs of measures and the resulting scatter 
of points is called a scatter diagram. The steps in setting up a scatter diagram 
are summarized in the accompanying outline. 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


175 


Outline for Setting up a Scatter Diagram for Two Variables 

1. Begin with pairs of measurements for the two variables. 



Score on 
Variable Y 


James 

70 

50 

Mary 

50 

30 

George 

40 

60 

Susan 

30 

40 

John 

20 

30 


Score on 
Student Variable X 


2. Draw a horizontal axis and a veitical axis. 

3. On the hoiizontal axis lay off a scale for variable X and on the vertical 

axis lay off a scale (or variable V. 

4. Plot each pair of scores as a point on the diagram, f or example, for the 

first pair, go out to 70 on the horizontal scale, up to 50 on the veitic.il scale, and 

plot the point. Each point, therefore, represents a pair of scores lor a particular 
student. 


The scatter diagram provides us with a means of observing the con- 
comitant variation of two variables. A few special examples are shown in 
Figure 39. 

These scatter diagrams illustrate a few of the many possible ways in which 
two variables may be related. We shall be particularly interested in the last of 
these illustrations, in which some degree of concomitant variation has been 
indicated but not perfect correlation. Most of the correlations to be found 
between various psychological and educational variables are of this type. 

DIFINITION OF THE CORRELATION COLFTICIFNT 

Our concern now will be to show how the correlation coefficient indicates 
the degree of closeness of concomitant variation between two variables.^ We 
shall present here a definition that will be helpful in understanding the nature 
of a correlation coefficient rather than in computing the value of the correla- 
tion coefficient,'’ 

1 To use a correlation coefficient as an indicator of the degree of closeness of con- 
comitant variation between two variables, it is necessary to assume that their true 
scatter diagram is in the form of a two-way normal distribution 

Methods for computing the correlation coefficient are to be found in an> elemental y 
statistics text. See bibliography at end of the chapter. 



176 


FUNDAMENTAL CONCEPllONS AND PROCEDURES 


If there is perfect positive correlation be- 
tween variables x and v, the points will all fall 
on a straight line like this* 



It there is pcrtcct negative correlation, the 
points will all tall on a straight line like this 





II there is no correlation between vanablis 
X and y then the scatter diagran will Iook 
something like this 



0 


X 


If there is some degree ol positive correla- 
tion then the scatter of points will indicate a 
general trend as shown 



Figure 39 Illustrative scatttrgrainv for diPercnt degrees of con elation 



177 


FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 

Variation Around a Line. Our detinition of the correlation coefficient is 
based not only upon the portrayal of concomitant variation by the scatter 
diagram, but also upon the idea of variation of the points in the scatter 
diagram around certain key lines. One ot these key lines is a horizontal line 
drawn through the mean of the y-variable (Y) in the scatter diagram,, as 
shown: 


The mean horizontal line drawn through the mean 
ot the Y variation {Y) of a scatter diagram. 


Y 


Now, suppose we dra v the scatter diagram of the X and Y variables again 
and this lime show the points in relation to the horizontal mean line. How 
can wc determine the variation ot the points around this line? 

The variation of the points around the 
horizontal mean line is based upon the distance 
ot each point from the line as indicated in the 
figure by vertical lines drawn Irom each point 
to the line. These distances are used in com- 
puting total variance. 


If we square the vertical distance of each point from the horizontal mean 
line, add these squares, and divide by the number of points, the result is the 
variance (see Chapter 7) of the points around the horizontal mean line. This 
quantity represents the total vanation of the Y variable. As explained in 
Chapter 7. if we take the square root of this vaiiancc, the result by definition 
is the standard deviation of the Y variable. Wc shall use variance here as an 
indicator of variation rather than the standard deviation because variances, 
in this case, can be added or subtracted. 

Taking the same scatter of points, we notice that the scatter of points 
seems to have a trend as indicated by the straight line in the accompanying 
figure. This trend line (also called the regression line) represents the variation 
of the Y variable with respect to the X variable and the exact position of this 
line can be calculated.’* Here, we shall content ourselves, however, with a line 
that approximates the general trend of the points, based upon our eye and 
judgment. 


1 


Lli 

w 

1 



I I I I I I I I I H 


’'111 most elemental y statistics texts, the “least squaies” method foi obtaining these 
lines is outlined or at least the necessary formula^ are given. 



178 


FUNDAMEJttaL conceptions A'^J^ procedures 



The ^a^iation of the points around the trend 
line is bas»d upon the* distance of each point from 
the line as indica-*^®^ the figure by vertical lines 
drawn froii ^^ach point to the line. These distances 
are used in computing unexplained or error variance. 


Again, if we square the vertical distance of each point from the trend line, 
add these squares, and divide by the number of points, the result is the vari- 
ance of the points around the trend line. We shall call this variance the un- 
explained variance or error variance since we are not able to account for, or 
give reasons why, the points do not closely correspond to the trend line. If we 
take the square root of the unexplained or error variance, the result is a 
standard deviation which is given a special name, “standard error of estimate.” 

If there is some degree of correlation or concomitant variation between 
the two variables under consideration, then the variance of the points around 
the trend line in the scatter diagram should be less than the total variation of 
the points around the horizontal mean line. In fact, the trend line by definition 
is that line that reduces the variance of the points to a minimum. In other 
words, the trend line is the best we can do to explain or account for the varia- 
tion of the Y variable with respect to the X variable. This leads us to what 
is meant by “explained variance.” 

Explained variance Total variance minus unexplained variance 

This simply tells us that the explained variance is equal to the difTercncc 
between the total variance around the horizontal mean line and the unex- 
plained variance around the trend line. 

The Correlation Coefficient as a Ratio, The correlation coefficient is not 
different from any other coefficient, in that it is basically a ratio. The square 
of the correlation coefficient is equal to the ratio of the explained variance to 
the total variance. 


Correlation coefficient squared (r^) ” 


explained variance 
total variance 


By taking the square root of both sides of the above equation, we can express 
the definition in another form. 

^ ^ ^ /explained variance 

Correlation coefficient (r) =r\/ — : 

\ total variance 

With this definition of the correlation coefficient, we can show how the 
degree of closeness of concomitant variation is indicated by considering the 
two extreme cases of perfect correlation and of no correlation. First we take 
the case of perfect correlation. 



further statistical concepts in measurement 


Total Variance 



(a) 


Unexplained Variance = 



0 


179 


Pcrjetl Correlation, higure a prcscnls a hypothetical scatter diagram 
for two variables, X and Y. 1 he total variation of the Y variable is indicated 
by the distances of the points above and below the horizontal mean line shown 
in the figure. In Figure />, the trend line or regression line is drawn and all the 
points fall on this line, d'his would indicate that there is no unexplained or 
error variance. Ihus, the explained variance and total variance are equivalent 
and their ratio is equal to 1. In the case of perfect positive correlation, r is 

equal to 1. • , • u • 

No Correlation. We turn now to a hypothetical case in which theie is 

no correlation between the two variables. The scatter diagram is provided in 

the accompanying figure and the hori/ontal mean line is shown. 



In this scatter of points, no trend is indicated In fact, the trend line and 
the horizontal mean line arc one and the same line. Therefoie, the total vari- 
ance of the Y variable is completely unexplained. In other words, the total 
variance is equal to the unexplained variance, and the explained variance is 
equal to zero. With the numerator in the ratio being zero, the coiielation 
cocnicicnt (r) then ib equal to zcio. 

imponance of the Trend Line. We can see now how important the trend 
line is in analyzing the correlation cocllicicnt. The trend line serves as a basis 
for dividing the total variance into explained variance and unexplained vari- 
ance. The correlation coefficient is directly related to the explained variance 
When the explained variance is zero, then t - 0, and when the exp .une 
variance approaches the value of the total variance then r approaches 1. I his, 
then, gives us a rough graphic picture of how the correlation coefficient inffi- 
cates the varying amount of concomitant vaiiafion lor two variables ine 
trend line does not necessarily have to be a straight line. In some cases, the 
scatter of points may indicate a parabola or some other curve. The definition 
of the correlation coefficient, however, remains the same. 



180 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


INTERPRETATION OF THE CORRELATION COEFFICIENT 

In interpreting the correlation coefficient, the definition we have presented 
can be used to good advantage. Let us suppose that a certain study reports 
that the correlation between two variables is .80. Using the definition, we find 
that 


_ ^^4 explained variance 
total variance 

By squaring the correlation coefficient r ~ 80, we find that the ratio of ex- 
plained variance to the total variance .64. In other words, 64 per cent of 
the variance of the Y variable is accountec^ for by its concomitant variation 
with the X variable. Likewise, if r — 40, the^i r- ~ A6 and 16 per cent of the 
variance of the Y variable can be attributed to its covariation with the X 
variable. On the basis of being able to explain variance, we can say that an 
r z=z .80 indicates a degree of correlation 4 times as great as an r ^ .40. Fur- 
thermore, the definition can be used to show that the difference in correlation 
between r ~ .80 and r — .90 is much greater than the difference in correla- 
tion between r — .20 and r =- .30. Ihcrcforc, the increase in the amount of 
covariation as indicated by the correlation coefficient is not evenly spread out 
from r 0 to r 1. Nor docs r — .50 indicate a halfway point in con- 
comitant variation, since it represents only 25 per cent covariance.'^ 

Correlation Does Not Mean Cause. A final word needs to be said about 
the fact that a high correlation between two variables docs not necessarily 
mean that variation in one variable will cause a certain amount of variation in 
the other variable. From our development of the correlation coefficient, it 
should be appaicnt now that covariation is the sole concern. Whenever w'c 
find correlation existing between two variables, all we can say is that the 
variation of one variable “is accompanied by” a certain amount of variation 
in the other variable, depending on how high is the correlation coefficient. It 
is quite possible that the correlation found between two variables is entirely 
coincidental. Studies have showm, lor instance, that there is a positive correla- 
tion between the number of inmates in insane asylums and enrollment in 
colleges. Also, a positive correlation between the salaries of Presbyterian 
ministers in Massachusetts and the price of rum in Havana was found over a 


” Often, instead of comparing r.itios of variances, a comparison of standard devia- 
tions is used in interpreting the correlation coeflicient. Spccificady, the standard error 
of estimate based on the trend line i*; compared to the standard deviation of the Y 
variable based on the horizontal mean line I he ratio of these two standard deviations is 
called an index of picdictivc efficiency for the coi relation coeflicient. Tables and curves 
are provided in most elementary statistics texts Howevei the essential characteristics 
of the correlation coefficient, which arc mentioned here, hold true whether we compare 
variance or standard deviations. 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


181 


ten-year period,^ In these instances we would strongly suspect that the correla- 
tion was purely coincidental and wc would not be likely to attach any signifi- 
cant causal relationship to the results. 

It is also possible that the correlation found between two variables was 
due to their both being related to a common factor. There is generally a high 
correlation between vocabulary and reading comprehension, but we cannot 
say which is the cause of the other. In fact, both of these variables may be due 
to the cultural level of the home environment. The correlation coefficient, 
then, should not be used to establish a cause-elTect relationship between two 
variables. 

Reliability Estimates 

In Chaptei 3, reliability was mentioned as one of the essential charac- 
teristics of a good measuring procedure. Briefly, in connection with reliability 
such questions are raised as: How accurate is the measuring piocedure? How 
reliable? How precise? How' consistent? How trustworthy? How much con- 
fidence can we place in the result? With an understanding of the normal 
probability curve and the correlation coefficient, we can now investigate more 
technically and analytically the meaning of reliability, and the types of data 
and statistical procedures necessary for estimating the reliability of a measur- 
ing procedure. 

For our purposes, to say that reliability is synonymous with dependability, 
accuracy, and consistency is insuflicicnt Wc need an operational definition, 
a definition that provides us with a method for determining reliability. In 
casting about for an operational definition of reliability, we might ask our- 
selves the question; How would wc ordinal ily check the rcliabilily of a 
persson? We could take a statement or observation made by the person and 
have the information checked independent!) by other observcis. It there was 
close agreement between the original statement of the person and the report 
of the observers, then wc would have evidence that the person is reliable. In 
science, the reliability of an CAperimcnt is checked in much the same fashion. 
The experiment is conducted independently by several scientists and the re- 
sults are checked for closeness of agreement with the original experiment. 
Likewise, a machine is said to be reliable when its succcssi\e perlormanccs 
are observed to be aj'iproximately ideniical. 

From these few examples just presented, wc can see that the process of 
determining the reliability of an observation or procedure is essentiallv a 
matter of comparing one or more independent observ^ations or performances 
with the original observation or performance and checking for closeness of 
agreement. This concept of reliability provdes us now with an operational 
basis for determining the reliability of an educational measuring procedure. 
We simply obtain independent observations ol the results of the measuring 
procedure and then determine how closely these observations agree. The prob- 

®See Harold Lariabee, ReUoMe Knowledge (Boston: Houghton Mifflin Co., 1945). 



182 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


lem of determining how closely the observations correspond is essentially a 
statistical problem and will be discussed after an analysis has been made of 
the possible sources of variation among observations. 

In educational measurement, we arc concerned primarily with human 
responses and, therefore, the sources of inaccuracies of measurement are 
necessarily complex. It is doubtful, then, if one can develop an all-inclusive 
list of sources of variation in educational measurement. However, the follow- 
ing are among the more important ones. 

POSSIBLE SOURCES OF VARIAIION IN HUMAN RESPONSES TO 
EDUCATIONAL MEASURING PROCEDURIS 

1. Sampling* Variation, In educational measurement, the measuring proce- 
dure usually consists of a sample of questions, tasks, or observations taken 
from an exceedingly large number of possible questions tasks, or observa- 
tions. Human responses vdry with diflcrent samplings of items and, hence, 
this is a source of unreliability in a measuring procedure 

2. Psychological and Physiological Variation, While an individual is un- 
dergoing measurement, he is subject to certain physiological and psychological 
determinants that may cause variations in his responses Some of these factors 
arc health, fatigue, motivation, tenseness, distraction proncncss, tendency 
toward guessing, and memory lluctuations Also included in this category 
arc such environmental factors as heating, lighting, noise, interruptions, and 
seating facilities, which may allcct the individual psychologically and physi- 
cally. 

3. Variation in the Techniial Aspects of the Mcasiiiing Procedure. A 
third source of inaccuracy in educational measurement is in the quality of the 
measuring instrument itself. This category includes variation in the adequacy 
of directions, in claiity of test items, in typographic quality, and in the scoring 
or rating procedure. 

These sources of inaccuracies all have cine common characteristic. The 
nature of their inllucnce in any given measurement i'- somewhat a function of 
chance. During any measurement procedure, thcie is some chance of a certain 
factor influencing the result unfavorably but perhaps an equal chance that it 
will increase a measure or improve a score. I here is also the possibility that 
it will not aflect the measurement at all. Consequently, as we have observed 
earlier (Chapters 3, 4, 5 & 6), increasing the length of a measuring proce- 
dure or repeated measurement will permit chance io operate as much to in- 
crease scores as to depress them and thus increases the reliability of measure- 
ment. 

It should be noted that psychological and physiological factors are likely 
to fluctuate at different rates. Some of these factors arc relatively constant 
during a period of measurement and even for several days or weeks, for ex- 
ample, health, motivation and pronencss to distraction. Others, such as 
fatigue, noise and interruptions, arc operative perhaps for only a minute or at 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


183 


most for the period of the given measurement. These varying rates of fluctua- 
tion tend to complicate the matter of repeated measurement. If measurement 
is reapplied after a short interval, the factors which fluctuate rapidly may be 
taken care of but those which fluctuate slowly may have remained unchanged 
and may continue to alTcct the measurement in the same way. 11 done after 
a long enough interval for these long term lactors to fluctuate normally, then 
the individuals being measured may have undergone substantial changes with 
regard to whatever is being measured. 

DLTrRMlNING IHL RTLIABILITY OF MLASURING INS I RUMFN I S ® 

We turn now to the procedures used to determine the reliability of measur- 
ing instruments employed in education. According to our previous statements 
about reliability, wc should have several independent observations of the 
results of the given instrument in order to check for closeness of agreement. 
In measuring human behavior, though, we know that the individual ma> be 
changed by the measuring process. For example, in taking a test, a student 
may benefit b) the practice and thereby increase in ‘lest-wisencss” and even 
in knowledge of the material. So it is necessary to keep the number of re- 
peated measurements to a minimum and even to gauge reliability in other ways 
than through repeated measurement. 

The three procedures generally used lor determining the reliability ol 
educational measuring devices and processes are: 

1. Retest — repetition of the same measuring procedure. 

2. Subdivided 7 est — where for purpxises of scoring a test is subdivided 
into two or more parts for comparison. 

3. Equivalent Test — where a paiallcl test is dcNcloped and administered 
lor comparison with the original test. 

Retest Procedun v. Rcjpplicalion ol the measuring device or procedure 
would seem fo be the most straightforward approach to checking its reliability. 
For instance, if w^e wish to check the reliability of an instrument tha! meas- 
ures a person’s strength of grip, wc would simply »akc repeated measurements 
for the same individuals and check the results for consistency. The same 
procedure would be applicable to all instruments measuring such motor 
abilities and physical characteristics as vision, hearing, weight, ball-throwing, 
and reaction speed. However, when we consider tests that measure psychologi- 
cal characteristics, we have noted that this procedure of repeated measurement 
runs into difhculty In addition to the practice eftects already mentioned, the 
individuaPs second approach to the same test will differ from his fiist approach 
since the novelty has worn oil, the directions arc already understood, and only 
a cursory glance is needed for the reading passages and questions Another 
limitation of the retest procedure is the fact that there is no provision for 

While reliability cstiinalcs usually aie confined to tests and ihc discussion here is 
in terms of tests, the piocedure.s described are applicable with appropriate modifications 
to observation techniquCvS, to rapng scales and to the scoring of products as well. 



184 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


varied sampling. The lest items remain the same for each performance. Chance 
fluctuations in physiological and psychological conditions may be provided 
for by this method if the interval between tests is properly chosen. Moreover, 
the method is well adapted to appraising variations in technical aspects of 
the test. 

Subdivided Test Procedures. 1 he second procedure for determining 
reliability, the subdivided test, represents an attempt to minimize the practice 
effect encountered in the retest method. Generally a test is administered to a 
group and then it is subdivided in some fashion for comparison of results. The 
test, after it has been administered, usually is split into equivalent halves by 
putting the odd-nurnbered questions into on< of the halves and the even- 
numbered questions into the other. I he two halves are then scored separately 
and are compared lor consistency. This procedure is generally called the 
"‘split-halves'’ method for determining reliability and it is the one most widely 
used because of the convenience of pro\iding an estimate of reliability from 
a single administration of the test. 

Although tlic “split-halves” method reduces the practice effect to a mini- 
mum, it docs have some important limitations. In the first place, this proce- 
dure does not provide tor the day-to-day or week-to-wcek fluctuations of 
human physiological and psychological factors. These factors may not have 
an opportunity to vary during a single administration ol a test. Unreliability 
in the technical aspects of a test and sampling variation are "’adequately ap- 
praised by this procedure only il the test is sufliciently long. 

Equivalent Forms Procedures. The third procedure for determining the 
reliability of a lest is to develop another lest according to the same specifica- 
tions but with diflcrcrit questions, and (hen to compare the results of the two 
equivalent tests. If there is a time lag ol a lew weeks in the administration of 
the two parallel tests, then each of the three categories of variation previously 
mentioned has an opportunity to affect the testing situation and practice effect 
is minimized. This equivalent test procedure then provides the most severe 
estimate of reliability and, for this reason, the procedure is preferable to the 
other tuo lor estimating the reliability ol tests. There arc, however, certain 
practical limitations to this procedure. The problem of constructing an 
equivalent test is a great one and the admini'^l^ation of a second separate test 
to the same individuals may place an unreasonable demand upon the individ- 
uals. The short cuts offered by the other two procedures arc tempting, but 
they are taken at the expense of accuracy in estimating the true reliability of 
the test. 

STATISTICAL DL TERMINATIONS OF RTLIABILITY 

Having considered the various factors involved and the methods for deter- 
mining the reliability of educational measuring procedures, we are ready now 
to outline briefly the statistics used in determining the degree of reliability. 
As was pointed out in the definition of reliability earlier in this section, the 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


185 


problem of determining how closely successive observations of a measuring 
procedure correspond is essentially a statistical problem. 

Reliability Coefficient. There arc essentially two ways of expressing sta- 
tistically the reliability of an educational measuring procedure. One way is 
to indicate how well each individuaFs score on the first testing or on the one 
half of the test corresponds to his score on the second testing or the other half. 
The degree of correspondence between the two results may be expressed sta- 
tistically as a correlation cochicient. A high correlation coefficient indicates 
a close correspondence between the individual’s scores in the two measure- 
ments and a high degree of reliability for the measuring procedure. This 
correlation coefiicient is more generally known as the reliability coclficient 
and is designated by the symbol rn (read as ‘V sub one-one”). 

Standard Error of Measurement. The second method of expressing relia- 
bility statistically is to indicate the amount of variation in the repeated Tiieas- 
ures of an individual. This is done by using the standard error concept dis- 
cussed earlier in this chapter. Owing to the somewhat random nature of ciTcct 
of the error factors involved, the re()eatcd measurements of an individual tend 
to form a normal probability curve. 1 he amount of variation for these re- 
peated measures is given by the standard deviation of this probability curve 
and is known as the standard error oj measurement. In educational measure- 
ment, we seldom obtain more than two observations of a measuring procedure 
for each individual, but from these two observations it is statistically possible 
to estimate the variation of a larger number of repeated observations. Ihiis, 
a smaller standard error of measurement indicates a higher degree of reliability 
for the measuring procedure. 

The two statistical measures of reliability, namely, the reliability coefficient 
and the standard error of measure, arc related in the following fashion: 

SE “ 5 V 1 - - Til SE - standard error of measurement 
S ~ standard deviation for the test 
rn — reliability coefficient 

INTERPRETING RFLIABII.IIY DATA 

We might now briefly consider the general problem of interpreting reli- 
ability data. For instance, suppose 't is reported that a certain standardized 
test has a reliability coefficient of .89. Hovv shall we interpret it? One w.iy is 
to find out and to compare the reliability coelficicnts fcT other similar stand- 
ardized tests. This procedure makes the interpretation of reliability entirely a 
relative matter. Another way is to determine the standard error of measure- 
ment for the test and then decide whether or not the lesults of the lest arc 
sufficiently accurate for the desired purpose. 

In interpreting reliability data and in comparing the reliability of different 
tests, certain general considerations should be kept in mind. 

1. Hovv' accurate results are needed? If the purpose lor using the test is 



186 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


to determine the relative standing of an individual in a fairly homogeneous 
group, then a very reliable test is necessary. On the other hand, if the purpose 
is to determine the general grade level of achievement of the individual, then 
a less reliable test may be used. 

2. What is the ran^e of the group being measured? In general, the lest 
measuring a group that presents a greater range in the trait being measured 
will have a higher reliability coefficient than a test measuring the same trait 
for a more homogeneous group. This fact should be remembered when com- 
paring the reliability of two tests. 

3. Is the test too easy or too difficult for the group being measured? In 
general, for a test that is too dilFicult, an undue amount of guessing may result, 
thereby lowering the reliability of the lest. When a test is loo easy for a group, 
it will fail to discriminate among the members of the group and again its reli- 
ability will be louercd. 

4. What is the nature of the measurement symbol resulting from the test? 
The fineness or grossness of the scaling, ranking, or classifying symbol has an 
effect on the reliability of the test. In general, the measuring procedures using 
scale symbols arc more reliable than measures using classification symbols. 
Essay tests that use classifying symbols generally have reliability difficulties. 

Validity Estimates 

As previously defined, validity has to do with aim — thcT capacity of a 
measuring procedure to measure what it purports to measure. 7 here are essen- 
tially two approaches to validity. One is called the Logical or rational approach, 
in which the validity of the lest is simpl> analyzed in terms of its objectives, 
the character of its items, and its general format. This approach has been 
amply discussed in Chapter 3. 

The approach we are concerned about here is called the empirical or sta- 
tistical approach to validity. A specific example probably \\ill show best how 
a test is validated by this method Suppo.se, for instance, v'C wish to check 
statistically the validity of a mechanical comprehension test. We would first 
select a typical group of persons for which the test is intended and administer 
this test to them. Then we would have each individual in the group be ob- 
served and rated by a group ol experts in actual situations that demand varying 
levels of mechanical comprehension A correlation coefficient is computed be- 
tween the test scores and the ratings of the experts and a high coi relation 
would be construed to indicate that the lest has high validity. The ratings of 
the experts is called the criterion and the correlation coefficient computed for 
the test is called the validity cctefficient. 

As stated above, the empirical approach to validitv first requires the estab- 
lishment of a criterion upon which the test is validated. This criterion should 
provide a reliable measure and be as tree from bias as possible. For some kinds 
of tests it has not been possible to establish such a criterion. Achievement tests 
are an example. For aptitude tests or tests used for prediction, it has been 



FURTHER STATISTICAL CONCEPTS IN MEASUREMENT 


187 


generally possible to establish a criterion in terms of later performance. Some 
tests are validated with reference to a similar test for which validity is assumed. 
For example, some intelligence tests have been validated on the basis of theii 
correlation coefficients with the Stanford-Binet intelligence lest. Since the 
setting up of a suitable criterion presents a difficult problem, the statistical 
approach to validity is somewhat limited. 

Summary 

The normal probability curve, the basis for man>/ analyses and interpreta- 
tions of measures, is a theoretical distribution oi expected probabilities for any 
given occurrence when chance alone determines occiiirencc or nonoccurrcncc 
and the opportunities for occurrence arc unlimited. The special property of 
the curve making it useful in educational measurement is the predictable rela- 
tionship between distances along the base line from the mean of the dl^tribu- 
tion and areas under the curve. The latter may be inlirprctcd as given pro- 
portions of the distribution. The curve, then, provides a standard wav ot 
estimating the operation of chance in behavioral mcasurt nient and for predict- 
ing the distribution of genetically determined human traits or dimensions. 

The standard error of measurement is the standard deviation of the normal 
distribution of any measure that is repeated an unlimited number ol times 
Thus it is an index ol scatter or chance variation in the measuie. II the stand- 
ard error of a pupil’s score on an achievement test were 5 and his score were 
SO, the odds are 2 to 1 that his true score lies someplace between 75 and 85. 

In education, concomitant variation between two sets of measurements is 
an important consideration. The correlation coefficient is a mathematical ex- 
pression for the extent to which variation in one set of measurements of a 
variable is accompanied by variation in a set of measurements ol another 
variable. The coefficient is symbolized by r and has rnagniuide liom 4 1 
through zero to --1. As r approaches ^ i, the relationship between the tv^o 
sets of variables is direct or positive and extensive. As it approache*' -1, the 
relationship is inverse or negative but equally extensive. Ihc coefficient itself 
indicates association only Any cause-and-cflcct relationship is a matter ol 
inference. 

Coefficients of correlation are used as indexes ol the estimated lehability 
of tests and other measuring instrun'cnts. Reliability nu.ins the extent to which 
a measuring proeedurc is likely to yield the same measure when it is reapplied 
to the same dimension of the same phenomenon. I he reliability ol icms may 
be assessed directly in some cases by actualfy reapplying the test to the same 
group and then determining the extent ol correlation between scores on the 
first administration and on the second. Generally, though, it is estimated in- 
directly by comparing the scores from comparable forms of the test or the 
scores from two equivalent halves of the test. 

Validity of a measuring procedure (how well it measures what it purports 
to measure ) may be expressed by a coefficient of correlation when there is a 



188 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


companion proceduie accepted as a criterion or standard and with whose 
results the results of the procedure m question can be compared Usually, 
though, there is no accepted criterion and hence statements about validity 
must be based o a logical analysis of the procedure and the measures it obtains 
for persons having certain known characteristics 


EXhRClSfS 

1 Why IS Ihc tcim ‘normal tuive” a misleading tcrm^ 

2 A student obtained a score of 84 on a test 1 'i which the standard error is ^ 
What IS the 68 per tent confidence interval' the 95 per cent confidence interval' 

5 Give examples of some possible uses of the noimal piobability curve to the 
classroom teacher 

4. What assumptions must be made by a classioom icachci who “giadcs ac 
cording to the curve”' 

5 Suppose that the coirclation between vocabulary and reading compic 
hension is 80 What pioportton of ihe variance in reading comprehension is due 
to variation in vocabulary' 

6 The scores on a spelling test aie noimally distiibutcd with a mean of 65 
and standard deviation of 15 What is the probability that a student's scoie will 
be greater th.in 80 ' 

7 Examine the test manuals ol thiee dideicnt standardi7ed te^ts in regard to 
what IS said about the reliability and validity of the tests Make a ciitieal analysis 
of each in the light of what you have learned n this ehaptei 




CHAPTER 9 


EVALUATIVE STANDARDS— MARKING AND 
REPORTING ACHIEVEMENT 


In the inlrodiiLtion to this first section of the book we defined measure- 
ment and evaluation as being a twofold action Jn the first phase, the status 
of a phenomenon was appraised in some manner and in the second, the value 
of this status was determined b\ comparing it with an appropriate standard 
So far we have dealt exclusively with the first phase, measurement itself We 
have examined the varied procedures by which appropriate measurement sym- 
bols may be assigned to measurible dimensions ol cducUional phenomena 
For the most part these s)nibols have been numcrah denoting classitieation 
or rank and the procedures have been observation, product anahsis and 
several types ol testing Now we wish to discuss the second jihase, evaluation, 
and Its applieition to schooling In education, and more pw-ticularly in the 
teaching situation, measurement s)mbols seldom arc used except tis a basis foi 
making quantitive judgments about the leliicvemcnt ol pupils lohnny’s te>l 
score usually is not left to stand by itself bat is the b isis tor a declaration that 
Johnny’s learning is excellent good I ui poor, or unsatisfaclor\, and the 
making of such statements seems to be one ol the major functions of a teacher 

Distinction Between Evaluation and Measurement 

Fvaluation is a process ol making qualitative determinations A.s such it is 
akin to what we have called measurement if not just a special form o\ meas- 
urement Just as wc assign symbols to phenomena to describe their status, so 
do we assign s)mboIs to phencmicna to indiealc then (juality or desirability 
And as wc often use a scile of inches, pounds, or seconds as the basis fear 
measuring stilus, so do we often use a scale of qualit> variation as the basis 
(or evaluating the phenomena Wc call these qualitj scales (\aluative stand- 
ards To show the distinction between measurement and evaluation as wc 
arc using the terms, con'^idcr the following illustration Imagine that we wish 
to buy some lengths of lumber for a house we arc building The evaluative 
standard to be applied to the lumber is the blueprint tor the house We go to 
a lumberyard and measure with a tape measure the linear dimensions of 
several pieces When wc compare these measurements with the lengths called 
for by our evaluative standard (the blueprint), we find that certain pieces 

190 



EVALUATIVE STANDARDS 


191 


would fit but that others would not, and so we qualitatively rale the pieces of 
lumber as being desirable or undesirable for our purpose. In this example, the 
measurement function and evaluation function arc clear-cut. It was the func- 
tion of measurement to provide the symbols representing the status of the 
pieces of lumber at that lime. It was the function of evaluatum to use the 
evaluative standards in making a qualitative judgment. 

From the example it may seem that the distinction between c\ aluation and 
measurement w'ill always be clearly discerned. On the contrary, there are 
times when the difference between measurement and evaluation is subtle and 
difficult to identify. This occurs when evaluation becomes automatic without 
any conscious thought after the measurement has been made. A common illus- 
tration oi this is when the relative size of a lest score automatieally vonnotes 
relative standing with regard to desirability; where it already is understood that 
the top score represents lop quality and excellence, and so on down to where 
low test scores automatically represent unsatisfactory standing. Anothci ex- 
ample of the unclear distinction between measurement and evaluation is the 
case where custom and long usage have fixed an association between certain 
measurement symbols and particular standards ol quality. Vor instance, cer- 
tain 10 brackets through long association have come to represent certain 
'Values” of intelligence: 2()“50, “imbecile." 50-70, “moron," 140 and up. 
“genius." Because of the prevalence of this automatic translation of measure- 
ment symbols into c\aluativc symbols it must be emphasized that measure- 
ment symbols are not in ihenisclves standards and it only by custom that 
they may directly symbolize value. 

Chapter 4 presents still further examples of confusion between measiirc- 
rnciit symbols arc not in themselves standards and it is only by ^ u'stom that 
and observations that measure or appraise status are often confused. We are 
more prone to oh^c^'e that Jimmy is a natural-born froubiL-makcr than to 
observe objectively what Jimmy actually docs. This points to the serious con- 
sequences of confusing measurement and evaluation. Measurement carries 
w'ith it a certain air of objectivity and finality, 1 hiis when a person thinks he 
has measured but aetualiy has evaluated, his evaluation iinfortunatcI\ takes 
on this air of finality and ohjectivitv. Jimmy has been “measured" and lound 
to be a natural-born trouble-maker His status has been appraised and there 
is nothing more that can be done about it. 

Mf^re examples ot the hazy distinction between measurement and e\ alua- 
tion could be provided. However, the too prevalent misunderstanding about 
what is measurement and what is evaluation, and the possible consequences ot 
this misunderstanding in the form of arbitrary action and injustice, should be 
sufficient to indicate the need for a full discussion of the evaluation piocc.ss 
apart from the measurement process. We shall begin wdlh an investigation of 
the nature and source of evaluative standards and symbols and then we shall 
discuss the evaluation process as a whole. After this we shall study briefly the 



192 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


matter of marks and reporting systems. Our treatment shall for the most part 
be restricted to evaluating achievement in school subjects. 

Nature and Source of Evaluative Standards 

In a sense, an evaluative standard is anything that is used as a basis for 
judging value or desirability. In our lumber illustration the blueprint was an 
evaluative standard. Sometimes a purpose is a standard, if pupils are trying 
to paint a mural, actions may be evaluated in terms of their contribution to 
the completion of the mural. Frequently, evaluative standards are purely 
arbitrary conceptions, c.g., 70 per cent is passing, 95 per cent is excellent, 
above average is good, etc. In such matters as dress, talk, manners, and letter 
form, the standard for evaluating merit usually is custom — what the greatest 
number of a given group do. As we shall see later, standards diller as to their 
validity and appropriateness. 

One type ot standard has particular significance for school and it will be 
discussed at length in a later section. This is a scale or hierarchy of per- 
formance. It may consist of gradations of handwriting deemed to coyer the 
range of possible quality in school children. It may be the taping speeds that 
must be attained if certain marks are to be received. Or it ma> be gradations 
in understanding of a subject that have been defined and set down. 

All too frequently we find in schools that evaluation staniiards are based 
only upon emotions and rigid preconceived ideas. How 1, the teacher, just 
happen to feel when 1 think of Johnny is one of these and, unfortunately, too 
often this one is the primary basis lor the mark Johnny receives. Another is 
the body of prejudice stereotypes about good and bad pupils, national and 
racial “types," and the “athlete,” the “student,” and the “clown " Evaluations 
based upon such irrational and emotional standards are more propcily termed 
prejudices or opinions. Needless to sa>, the use of iirational standards tends 
to vitiate effective evaluation. 

The ultimate source of all evaluative standards is, of course, the value 
complex of our American culture. Their immediate source is the writings oi 
experts in philosophy, psychology, sociology, history, and in other social 
sciences who have studied this complex By observing what people say is good 
or bad, by observing the choices they make, where they spend their time and 
money, by examining legal documents, religious documents, and other expres- 
sions of ethics, and by looking at customs and traditional practices, these ex- 
perts arc able to express a consensus of what culture believes to be worthwhile 

Many of the standards used in our schools arc derived directly from the 
general values of our culture. Thus, punctuality is an American value, so 
standards of pupil citizenship place a premium on being on time and a stigma 
on being tardy. Knowledge of our country’s past being prized by American 
citizens, standards of knowledge for school pupils at all levels are heavily 
weighted in the direction of the facts of United States history. Nowhere is. the 
relationship between culture values and school standards more apparent than 



EVALUATIVE STANDARDS 


193 


in the case of social adjustment. As a group, Americans esteem extroversion 
and varied pursuits. So in the schools, many guidance officials are concerned 
about taciturn children and those who do only one thing, and lend to label 
the leaders and the active as well adjusted. 

Other standards in school use are particular just to schools themselves. 
They derive principally from the practice of countless teachers, the nature of 
subjects taught, and the developmental characteristic of pupils. An example 
of the first is the standard that “70 per cent is passing.” The second source 
is illustrated by the premium placed on accuracy in arithmetic. The influence 
of child development is seen in the differential standards held at successive 
grades for such subjects as reading, writing, art, and music. 

CRITERIA OF VALID STANDARDS 

The selection or development of evaluative standards is a crucial under- 
taking. Not only do the evaluative standards reflect the objectives ol a school 
program but also, as we shall sec later, they have an important effect upon the 
construction of measuring devices. As a guide, therefore, to identifying and 
devising valid evaluative standards, the following criteria arc presented: 

1. 'Fhe evaluative standards should consist of a variation scheme of qual- 
ity, desirability, or value. There should be no question about the fact that the 
standards arc concerned with quality or value only. 'Fhis means, then, that 
qualitative terms should be explicitly expressed and not vaguely implied. Also, 
symbols that tend to confuse variations in quality with variations in status 
should be avoided, 

2. The degrees of variation of quality should be clear and veil defined. 
T here should be distinct bc.)undarics between the different categories of value 
and the meaning of each category should be explicit. 

3. The evaluaiive standards should be reasonably stable and objective. 
That is to say, the standards should not be subject to sudden arbitrary changes, 
nor should the standards be expressed in vague terms that will permit capri- 
cious interpretation. Pupils should Iccl that the standards applied to them will 
provide consistent and fair evaluations of their performance. 

4. The evaluative standards should be expressed in terms that are appro- 
priate and favorable to the best procedures of measurement. Too often we 
encounter standards stated in such a way that it is well-nigh impossible to set 
up a measuring procedure to provide a satisfactory basis for making an evalua- 
tion. Suppose you were asked to make a qualitative judgment on someone's 
“ability to think effectively” or on someone's ‘'personal satisfaction in achieve- 
ment.” You might agree that these terms do not facilitate measurement. Yet 
such general phrases often are the elements ol evaluation schemes. We shall 
find that evaluative standards and measuring procedures are highly inter- 
dependent in the evaluation process. Properly stated standards generally pro- 
vide for the best measuring procedures and likewise sound measuring proce- 
dures provide the best basis for developing adequate evaluative standards. 



194 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


5. The evaluative standards should he consistent with scientific findings 
about learning and child development. The standards should be neither too 
high nor too low for the maturity ot childien in given grades Nor should they 
be inconsistent with the way wc learn An example oi the latter would be the 
placing of a high qualitative rating on a student saying or wilting the right 
words without understanding their meaning Another example would be the 
placing of a high qualitative rating on a correct numerical answer to an 
arithmetical problem without determining it the process was understood by the 
student Other psychological aspects ot evaluative standards will be discussed 
more tully later 

6 Finally, the c\aluati\e standanls should be consistent with the \alues 
of our culture For one thing, they should reflect the emphasis m our culture 
on human woith and dignity, and on democratic \alues in general A low 
qualitative lating, for example, should be placed on docile or completely sub- 
missive behavior hor another, they should relate realistically to standards ol 
woik and production held by business and mdustiy Without this, pupils are 
apt to be unrealistic in their flist attempts at vocational adjustment 

A CRITIQUE or SOME I V VI UATIV F SIAND\RDS IN C I RRfN I LSF 

Now that we have belorc us some criteria lor judging Ihe validity of evalua- 
tive standards, let us examine three types ol standards sometimes used in the 
schools 

1 Standards based upon arbitraiy poi turns or points of distributions 

2. Standards based upon arb*tiaiy amounts of work completed or percent- 
age of questions answered coireelly 

3 Standards based upon emotions and preconceived stereotypes 

Distribution Stunclaids An example of the fust type ot standard is pro- 
vided by a teacher who will assign only 10 per cent A\, 2S per cent 40 
per cent C s, and so on, to any distribution of test scores. Ordinarily these 
percentages arc applied to distributions oi whatevei groups are being dealt 
with by the teacher at the time Standaids ot this l\pe tail to meet the first 
criterion listed in the previous section No variation in quality is explicitly 
stated OuaIit> ot pciformanee has been confused with relative standing m 
arbitrary groups Since groups vary greatly m perfornianec, this type ot stand- 
ard fails to meet the cnterion of stability An excellent student may bo penal- 
ized unduly for being in a superior group 1 he same criticism applies to eval- 
uative standaids based upon certain points in the distribution, such as the 
average (mean or median), and also to standards that claim to be based upon 
portions ol the normal probability curve (“grading on the curve”) 

Percentage Standards 1 he second type of standard is used by a teacher 
who assigns an A to students answeiing at least 95 per cent of the questions 
correctly, a B for 85 per cent answered correctly, and so on, or by a teacher 
who assigns giadcs on the basis of the number ot projects or experiments com- 



EVALUATIVE STANDARDS 


195 


pleted. Again, this type of standard may fail to meet the first criterion Varia- 
tion in quality is only implied and may not be properly related to variation in 
performance. Likewise, since tests do vary in over-all difficulty, the stability of 
this type of standard is open to question. For example, a superior student may 
be penalized by an unduly difficult test and a poor one favored by an easy test. 

Emotional Standards. The third type of standard obviousl) fails to meet 
the criteria because of its extreme instability and capriciousness As soon as 
emotions and feelings enter into standards, all objectivity goes out the window. 
Teachers will always be subject to temporary feelings ol fatigue, anger, and 
apathy that will color their evaluations. Also, there are likely to be present 
unconscious stereotype reactions and unconscious refusals to reconsider 
values. 

While these “standards’' do iKit constitute valid bases for evaluation, many 
others used by teachers do. 7 he nature and development ol these are the 
burden of this chapter. 

Evaluative Symbols 

In the fourth criterion of essential attributes of valid slandaids, it is im- 
plied that they may be expressed in \arious ways, some of which are inappio- 
priate lor sound measurement. This ral^es a question as to what arc the sym- 
bols or forms used in expressing evaluations. 1 hl^ question sounds familiar and, 
in fact, is like the one wc asked about measurement in Chapter 1. We have al- 
ready indicated that evaluation is essentially a process ol measunng quality. 
Therefore it is reasonable to assume that the sNinbols ol evaluation can be 
classified in the same manner as the symbols ol measiiremeitt, ic, symbols 
that indicate a scale portion, symbols that mdiLdtc rank or c rdei position, 
and symbols that classify or describe 

In Chapter 1. it stated that a unit of measure is itquired lor symbols 
that indicate scale position Consequently, m order lor cwalualions to be 
expressed in the form of scale s>mbols, a unit needs to be established So far, 
and excluding certain research situation >, no unit of qii.ility or desiiabiiiU 
has been established, “quality" being such an ambiguous consideration There- 
fore, we diall not expect to find evaluations expressed in the form ol scale 
symbols 

Wc should, though, expect to find and do find in education that evaluative 
standards and quaJitativ'e judgments utilize lank s\mbols and classuieation 
symbols. Ordinarily the letters A, li, C , D, / are used to indicate tlu quality 
of classroom performance. In other places the woids “excellent," 'good," 
“fair," “poor" may be used, or possibly numbers such as 3, 2, 1, 0, I arc 
employed. All these symbols arc classificatory since each one repRsenis a 
category of quality Also wc sometimes encounter pcruiuile ranks and other 
rank or order symbols, which impl> standing in regard to quality 

In using an> scheme of evaluative s}mbols, it is essential that they be 
related to unambiguous and appiopriatc statements about the qihility grada- 



196 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


tions they represent. When the symbols stand alone, they have very limited 
meaning. Suppose, for example, we were informed that a sixth-grade student 
was given an A in spelling. The basis for the qualitative rating, the A, is not 
apparent. We know only that he was given a top rating. It might be generally 
assumed that the rating was made simply on the basis of his being able to spell 
a large number of words correctly. However, must wonder about the words 
this student was able to spell correctly while others failed to do so. Were these 
words phonetically obscure? How often have these words been previously 
encountered? Moreover, wc should wonder how much difference in spelling 
ability exists between our pupil who received an A and another who was 
given a 5 or a D. 

The full role ot standards in the evaluation process should be apparent by 
now. The evaluative standards arc the point of reference lor qualitative rat- 
ings; and, after these ratings have been made, the evaluative standards arc the 
basis for interpreting the evaluative symbols that have been assigned. As was 
indicated in the previous paragraph, if the evaluation symbols are presented 
without knowledge of the standards to which they refer, the meaning of these 
symbols is limited to an interpretation of relative standing only. 

Too often in education, though, just the results ol evaluation, the assigned 
symbols, are presented. The evaluative standards used in the process arc at 
best only vaguely implied m the test or the instrument used I his means that 
the person whose perlormance is being rated must take on fatih the I act that 
the evaluative standards used are valid, fhe same laith is expected ot the 
parents and friends of the individual being rated, and is even required of other 
educational institutions. 

Steps in Valid Evaluation 

Now that wc have examined the natuie of standards and the use of eval- 
uative symbols, wc wish to discuss in detail just how a teacher may go about 
evaluating the achievement of pupils. In doing this wc shall present first an 
“example” and from this develop certain steps in valid evaluation and an ex- 
ample of a defensible standard. Assume that Miss X\ class has just com- 
pleted a topic and she now wishes to evaluate her students’ “comprehension of 
the material covered.” Following custom, she begins to prepare a test. As she 
makes up questions to ask on the test, her first concern is coverage. She W'ants 
to be sure that the questions on the test cover all or at least most of the im- 
portant points discussed in class. Ordinarily, this consists of thumbing through 
the text or some written material that was used and asking a question or two 
from each page. Her next concern is to make certain tha' her test includes 
some easy questions, which most of her class wall answer correctly, and also 
some more difficult questions which only a few can answer correctly. These 
questions are all assembled into a test and administered to her class. After the 
tests are scored, she assigns qualitative ratings, A, B, C, D, F on some arbi- 
trary basis, making sure that the number of A's, /?’s, C’’s, etc., assigned corre- 



EVALUATIVE STANDARDS 


197 


spends roughly to some preconceived distribution. Thus the evaluation process 
is ended. 

Now that wc have before us an example of what is often done in the name 
of evaluation, let us analyze how evaluation may need to be performed if it is 
to be valid. As we proceed, we shall refer frequently to what was done or not 
done by Miss X. 

1. Determine what is to be evaluated. The first step in the evaluation 
process seems to be obvious. It should consist of stating that which is to be 
evaluated. What may be evaluated in this instance is no more or less than 
some measurable dimensions of pupil achievement. In Chapter 2 we estab- 
lished certain conditions of measurability and these again are applicable in 
establishing the dimensions to be evaluated. The three most immediately 
relevant conditions are that the dimensions have observable variations, be 
clearly defined, and conducive to agreement among observers. I'hese condi- 
tions necessitate that the learning or achievement or ‘'pupil objectives" to be 
evaluated must be stated in terms ot behavior or performance. For example, 
an objective, “knows a valid argument,” would be interior to the objective, 
“is able to identify an unstated assumption necessary for arriving at a certain 
conclusion.” The latter, being a behavior, is more observable, better defined, 
and in other wa>s conics closer to meeting the criteria set forth in Chapter 2. 
In Miss .V’s case the focus of evaluation seems to be something rather vague, 
not directly observable, and hence dilficult to evaluate — “comprehension of 
the material just completed.” 

2. Establish a standard. After determining what we wish to evaluate, the 
next step usually is to select or devise a standard appropriate to its evaluation. 
While wc have stated that several dillcient types of standard are used for 
evaluating pupil achic\cment, the one that seems to be most valid is a p^erform- 
ance scale or hierarchy. Moreover, if our focus of evaluation is to be pserform- 
ance or behavior, a standard set in the same terms is indispensable. Our stand- 
ard should consist of descriptions of several levels of performance relative to 
each learning objective or dimension of achievement. These should range from 
the least “valuable” picrforinancc to the most “valuable" and should be stated 
in clear behavioral terms. (An example is showm on page 202.) This step of 
establishing a standard is only vacuely implied by Miss A' in her selection of 
easy and difficult questions and unfortunately it is rarely specified as a part 
of the evaluative process in most descriptions of educational evaluation 

Setting these evaluative standards is mostly a matter of thoughtful experi- 
ence on the part of a teacher. I'his, of course, places the new teacher at a very 
serious disadvantage. In this case, a new' teacher should use the experience of 
successful teachers and should consult pupils’ texts, teacher guides, courses 
of studies, and any other material that may provide information about appro- 
priate standards. It is important that the beginning teacher at least make an 
attempt to set forth these standards, no matter how crude they may be. These 
first attempts will start the teacher on the light path toward effective and valid 



198 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


evaluation and will provide a basis for future revision But at no time should 
these standards be considered fixed or absolute 

S Ft c pare tin niccisufim^ dcMccs Alter the evaluative standards have 
been set lorth in the loirn of var\ing levels ot pertormanee, the third step m 
the proeess eonsists of eonsliueting or finding a proceduic that provides a 
measure ot perfornnnec relative to eaeh dinunsion ot aehicvenient eited in 
the standard J he vaiious types ot measuring proeeduies and spceifieations tor 
their prcpaiation hive been thoioughly diseusscd in Chapters 3, 4, 5, and 6 
In the case ot Miss X, this step ot preparing the pioeedure is easily recognized 
as the one where she foiinulatcs her questions ind assembles them into a test 
In her ease, though, this was the first step ind we should sec now that the test 
or other measuring device ordinarily should be devised in aeeordanee with an 
already established evaluative standard I his me ins that, as the teacher con- 
structs her test or measuring device, she should have beside her a set ot evalua- 
tive standards m the form ot levels ot perform mec and her test should be so 
designed that each level is jdcquatcly simpled Such a measuring device will 
then permit the te icher to observe the pertormanee of eaeh pupil at each of 
the Jevvis eont lined in her evaluative standaids 

4 Measure and c\uliutc Onee the measuring instrument his been de 
veloped or selected the fourth ird linal step m iht evaluation proeess is 
simple the instiument should be administered to the group«i)i students in 
volved scores related to the stand ird and quilitative symbols assigned I ich 
of the levels ot pertormanee in the evaluati e stindards should be represented 
by a S3mbol or b> a single descriptive vvoid Ordm irilv when symbols ire 
used, they aie oidered in such a way that the first symbol leprescnts the top 
level or gradition the second svmbol the next hi hest level and so on down 
the line until the lowest Icvd ot perform inee is leiehed I ich student is as 
signed the evaluative s\mbol th it icpreSLiils the highest eiade of perk^rmmee 
achieved bv the student 

In Miss A s e ise il is appirent that hei studt nts hwe been ev ilu.ited ae 
eoiding to seoies leeeived on hei test Now since she his no well defined set 
of evaluative stindaids she has to base her qualitative ratings upon some 
arbitrarily established distribution ot ritings oi upon some arbitrary per- 
centage ot quistions answcrv.d eorreetlv An e^ampk of the first alternative 
would be to decide that onl) 10 per cent of the group will be assumed the 
highest qualitative rating, therefe)re she selects the necessary number of stu 
dents from those who stood highest on her test An example of the second 
alternative would be to decide aibitiaril) that those students answering less 
than 70 per cent of the questions correctly would be given an unsatisfactory 
rating In the ease of either altei native it is apparent that the ratings would 
have no clear cut or consistent meaning m Icims of performance 

TWO EXAMPLIS OF rVALUAFlON 

Now that we have discussed the four essential steps in the evaluation 
process, we look at two specific examples of an attempt to carry out these 



EVALUATIVE STANDArS^ 199 

steps*. The first is an attempt made by a second-grade teacher Her class has 
just spent a week studying a story and she is now ready to evaluate the accom- 
plishment of her pupils 

In line with the first step in the evaluative process, the tc ichcr has stated 
her objectives lor this particular unit in terms ol pupil bchavK^r 

1 Ability to remember and understand the iaetual content oi a story 

2 Ability to reeogni/c and use the new words pre'^ented in the stor> 

The second step in the process asks that the Uaehti •'Ci forth her evalua- 
tive standards This she does by dcscribinj the following levels ol perlomiance 

lixtl Pel foi titanci 

I Unsatisfaetoi V ^ amiol pie^nounee corieetb thi. nc v word> when seer 

written 

Cannot inswci qucsMon> iboit the most obvioi i facts 
contained in the ston 

11 lair (an pronounce ths new woid^ correetlv vvhen ecn 

w ntten 

C 111 iinswer questions \houi the most obvious f iuts 
coni incd in the stoiv wImi the questions are phiased the 
siRK w ly is the st itemeir in the stoiv 

111 Ciood Same as Ic'd 11 ind m id litK n ( m undeiline the 

iK^w words tioni imonr otrier word whei t^e new word*- 
\}^ spoken Also h is i lo i^h ide i ol the nic ning ol the 
new word 

C in inswci nune subtle eiuestion ibout d e content ol 
the story when tii questiois iic ])hiuseJ dvhe[entl> iiom 
the senleiiees in ine stoiv \!sO e m recall the sequenee oi 
the story 

VI Lxeelleiit Same is ksels II mdJII nd in iddition ( neoiieetlv 

use new words in sentences ol liis e^wn constiuetKn drawn 

lioni his own experience' mi e tn uisntity w‘ct ^he new 
woids ire impropLilv used 

( an relite ttie eonti^nt ol the stoiy to his ow i expeti- 
en^c md can t^ive pi ui ible explinilions ol tlie vent> in 
the story 

Since the Uacher is eoiiecrned with two objectives (or dimension*' ) , she 
ihcrelorc needed to indicate varying levels oi perioimancc re'garding the 
achievement for each objective The way in which the peilormance levels are 
described above would indicate that the teacher has had already some experi- 
ence with second-grade children and has done some serious thinking about 
what constitutes varying degrees of desirable behavior on the pait ot her chil- 



200 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


dren On the whole, the perfomicince scale she used meets the cntena for 
evaluative standards outlined earlier in this chapter 

With the evaluative standards before her as outlined above, the teacher 
IS now ready to take the third step in the evaluative process This consists of 
developing a measuring device which will measure pupil performance at each 
of the four levels and relative to both of the dimensions m the evaluative 
standards We shall provide here only a few examples of questions eliciting 
performance at each of the levels At the second-grade level, the teacher will 
have to read the questions orally and the children will cither answer orally 
oi on a mimeographed sheet provided for them 

I^or Lc\ch / and // 

(Teacher) Rc id e ich ot these woids out loud 

1 hop NOTt Ihcse words appeir on the childrens sheet and individudly 

2 shoe they ire to pronounce c ich word is they sec it 

3 walked 

4 fun 

Oral questiems ind inswcrs 

1 Whit did Bill and Susm do when tlnv pi i\ed^ 

2 Bill Slid we must play in whai kind ol shoes^ 

tor I i\ el til 

(Tcichcr) Driw a line under the rnhl word in e leh box is 1 siy it hop 
2 shoe ^ w liked 4 oui 


1 have 

2 show 

^ \v liked 

4 one 


store 

*v intcd 

out 

how 

shot 

went 

own 

hot 

s iw 

w ilk 

our 


(Teacher) Di iw i line undei ihc n^^hl words is 1 icid e leh sentence 
slop 

1 Bill and Sus m e in 

2 Bill had a in his shoe 

n 111 

3 The children 

4 Bill said, We must play m — ” old shoes 

^ ^ one 

Oral questions and answers 

1 What kind ot store did Bill and Susan and Mother go to^ 

2 What did Bill do after he found the hole m his shoe*? 

3 Tell what happened after they left the store 

4 Can you show how Susan hops ^ 



EVALUATIVE STANDARDS 201 

For Level IV: 

(Teacher) “Make up a sentence about what you did or saw, using these words.” 

1. hole 

2. walked 

3. fun 

4. looked 

“Draw a picture about these words.” 

Oral questions and answers: 

1. Why did Bill say, “We must play in our old shoes”? 

2. What caused the hole in Bill’s shoe? 

3. Why didn’t Bill’s Mother scold him for the hole m his shoe? 

4. Can you tell a story about when you thought you were going to be scolded and 
you weren't? 

These are a few of the questions that the teacher in our example de- 
veloped to measure performance at each of the levels set forth in the evaluative 
standards. Those designed for Levels 1 and II arc keyed to minimal perform- 
ance and are used to distinguish between unsatisfactory achievement and that 
which is barely satisfactory. Level 111 questions represent the next higher level 
of performance, and finally the last group of questions clearly is keyed to the 
top level of performance. It should be apparent that the conitruction of a 
measuring device becomes much simpler when the levels of pcrfonnance in 
the evaluative standards have been carefully delineated. Ihe teacher, of 
course, should construct a sufficient number of questions so that each dimen- 
sion is well sampled at each level. (See pages 102-104 for a detailed discus- 
sion of sampling in test construction.) 

The final step in the evaluative process consists of administering the test 
and assigning the qualitative symbol that represents each pupil's highest level 
of performance. Please notice that these symbols now have very specific mean- 
ings in terms of performance. 

Further to demonstrate effective evaluation, we vvish now to present an 
example from the upper elementary grades. In thi^ one, the teacher wishes 
to evaluate his pupils’ understanding of the following written passage. 

The oreat Law ^ 

Not long after his arrival in the New World, William Penn called a meeting 
at Upland of all the Swedish, Dutch, and Lnglish sctilci^ Alter making a short 
friendly speech, the Quaker leader told the people about the constitution, or set of 
laws, under which they were to be governed. 

Penn’s constitution was known as The Gieat L<iW. It piOMdcd that the colonists 
should be free to worship in any w^ay they saw fit; that all settlers who paid taxes 

^ G. V. D. and I. V. D. Southworth, Larly Davs in the Nev HorUL Syracuse, N. Y.: 
Iroquois Pub, Co., 1950 (pp 242-243). 



202 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


had the right to vote, that every male rrember of any Christian Church might hold 
office, that all children should begin training for some trade or useful occupation at 
the age ol twelve, that only two crimes, murdci and treason, should be punishable 
by death, and that all prisoners should be pul to work at some useful tiadc so that 
they might become good citizens Ihcse last two provisions ot Penn's Great Law 
are especially inicicsimg in view ol the fact that in I ngland at this time there were 
over two hundred crimes punishable by death, and that no one had ever thought 
before ot making a prison an> thing but a place in which to lock up people who 
had done wrong 

As a basis lot judging the pupils’ undeis^anding ol “The Great Law,” the 
teacher prepaics the lollowing performance hierarchy 

LcmI Pci jot main c 

1. Unsalislactory Cannot answer questions about the most obvious facts 

contained in Ihc story 

If. lair Can an^v\cI questions about c'bvious facts in the read- 

ing material 

111. Cjood Same as level Tl and in tiddition C<in make com- 

parisons and discnmm.ilions, and can •formulate m own 
v\oids dchnitions and illustialions ol concepts presented in 
the reading nialuiid 

JV. t vcellent Stime «s levels II and III and m tuldition Can gi\c 

explanations and inteiprelalions and ean justil / and pre- 
dict on the basis ol what is eont uued in the reading ma- 
lerial 

With this variation scheme as her ^t indard the teaehcr may then develop 
an instrument tor measuring perlormancc at caeh ol these levels Some pos- 
sible questions that rder to “Ihc Gieat Law’ arc presented as tollows. 

f or Lc\ c /v / cine/ II ( Simple rec ill ) 

1. Who attended the meeting at Upland^ 

2 What crimes were punish ible b> dc ith under the Great I aw^ 

3. At what age were children required te '.tart their occupational training*^ 

I or Le\el III (illustrate, dchne, compaie^, discriminate). 

1 How would voLi define tieason^ 

2 Give an example of a pci son who would not be able to vote under this Great 
Law. 

3. What IS mc«int by the phrase “hold oltice”^ Give an example 

4. What does “a trade oi useful occupation” mean? Give an illustration of some- 
thing that is and something that is not a trade or useful occupation. 



EVALUATIVE STANDARDS 


203 


For Level IV (Explain, justify, predict, interpret): 

1. Why do you suppose it is required that a person be a taxpayer before he has 
the right to vole? 

2. Suppose you were able to visit a prison in England and a piison in William 
Penn’s colony at that tinic. What essential dincrencc would you expect to find 
between these prisons? 

3. Mr. Williams, a colonist, attempted to run toi office but was dJ^qualificd. For 
what reason do you suppose he w'as disqualified? 

4. Would you say that reducing the number ot crimes punishable by death is an 
improvement? Give your reason. 

These arc some of the many questions that may he used to elieii responses 
indicative of each of the levels in the evaluative standard 1 he questions are 
wTitten in short answer form but they can be changed to true-false, multiple- 
choice, or to some other guided response form if desired. Oidinarily a much 
greater amount of reading material is covered by a classroom lest. Ihi^ short 
passage was selected, however, to illustrate on a small scale how the evaluative 
process can be carried out elTcctively. 

We can see in the two examples the importance ot carefully prepared 
evaluative standards. They serve as guides for developing an effective iijcasiir- 
ing instrument. F'urthcrmoie, they give speeihc meanings lo the qualitative 
ratings assigned. When a student receives a mark of ‘‘good’" or a B, both the 
student and the teacher have a clear understanding of what the svinbol means 
in terms oi his performance. 

Levels of Performance as Standards in Evaluating Achievement 

Whether the standard,> arc arbitrary distributions, teacher I’eolings, or 
given percentages of questions answered, levels oi perfoimance axe. of course, 
involved in some way. DifTercnecs in lank in a distribuorm ol te<t scores 
necessarily arc somewdiat a lunction of differences in performance the iccl- 
ings of teachers about pupils certainly are based on the performance of pupils 
and, in a sense, to answer a greater pcrcciilage of lest questions is to perform 
at a different level In this text, though, we are proposing that thi levels of 
performance should always be explieii and that evaluative standa.ids for 
achievement usually should consist of perlormance scales or hierarchies. 

lo help the beginnci, we shall present a generalized scheme of variation 
in performance which is thought to be applicable lo the iindcrsiaudlm oi any 
verbal subject. Since most subjects studied in school vire vcihal or have verbal 
aspects, llic scheme should have widespread significance. 

The use of defined Ifwcls of understanding as evaluative standards has not 
been extensive therefore this particular vaiiation scheme should be con- 
sidered as a beginning and not as a finished prc»duct. 1 he levels are meant to 
be suggestive only. Depending upon the particular sub|cct, grade, etc , the 
levels may overlap, they may omit important items, and some of them may be 
inappropriate. 



204 


FU^D AMENT AL CONCEPIIONS AND PROCEDURES 


Performances Indicating Different Levels of Understanding 
of a Given Subject 

Level Pt r for mane i 

i Imitating, duplicating, repeating 

This IS the level ot initial contact Student can repeat oi duplicate 
what has just been said, done, or read Indicates that student is at least 
conscious or aware ol contact ^Mth a particuUr concept or pioccss 

II level I, plus recognizing, identifying, remembciing iccalling, clas- 
sdying 

To perform on this level the student must be able to recognize ot 
identify the concept or process when encountered later, or to remember 
or recall the essenti d feature ^ of the concept or pioccss 

III Levels I and II, plus conipaiing, relating, disciiminnting icfoi mutat- 
ing, illustrating 

Here the student can compare and i elate this concept or process 
with other concepts or pioccsses and make disci innn itions Hi can 
lormulate in his own words a definition, and he can illuslralc or give 
examples 

IV 1 evcls I, If, and III plus explaining, justilying predicting cstimat 
mg, interpreting, making ciitical judgments, drawing inleicnecs 

On the basis ol h'a understanding of a concept oi process, he can 
make explanations, give reasons, make predictions, interpret, estimate, 
or make critical judgments This performance lepiescnis a high level 
of understanding 

V levels I, II, HI, and IV, plus ere iting discovering, leorganizing, 

formulating new hypotheses, new questions and problems 

I his IS the level of original and productive thinking The student’s 
understandmg has developed to such a point that he can make dis 
covenes that are ntw to him and can restructure and leorgmizc his 
knowledge on the basis ot his new discoveries and new insights 

In using such a performance scale there is the assumption that the teacher 
knows pretty well what the students are reading and otherwise experiencing 
If a teacher is not too well acquainted with what a student has read or done, 
then she may be misled into thinking that a student is performing at Level IV 
in being able to explain or predict while actually the student is operating at 
Level II, simply remembering what he had previously read or done Further- 
more, the student may be performing at Level V with some original thinking 
whereas the teacher might believe that the student is simply imitating or 
duplicating at Level I 

The fifth level of performance, original and productive activity, is the 



EVALUATIVE STANDARDS 


205 


ultimate goal of all education. At this stage the student is capable of inde- 
pendent work. Since most creative activity is spontaneous, the teacher cannot 
so easily solicit response at this level as she can at the other levels. It is also 
interesting to note that a teacher can ordinarily expect to lead a student 
through the first four levels of understanding but for the fifth level the student 
must be his own guide. 

As a final remark on the evaluative process, we must say that much more 
study is needed in developing general variation schemes for quality of per- 
formance that can serve as a basis lor teachers to develop their own evaluative 
standards for their own particular situations. Substantial progres', has been 
made toward setting up carefully defined objectives for various school activ- 
ities. However, there has been too little done toward establishing ciitTcrential 
levels of perl'ormancc in the attainment of these objectives.- 

The Function of School Marks 

At first glance, it seems that if the evaluation process has been carried out 
properly, there should not be much of a problem in reporting the results On 
the contrary, we shall find in this section that the reporting of school marks is 
a very vexing problem even with eficctive cxalualion The latter, though, is 
prerequisite to any possibility of success in reporting pupil progress, as 
Wrinkle has testified: 

Finally, almost ten years later, we discovered that wx‘ couldn't leport intel- 
ligently unless we first evaluated intelligently, and that we couldn't evaluate intel- 
ligently unless we knew what wc were living to do ( 15:3) 

Before we get involved in the details of reporting practice, let us consider 
the several 1 unctions thal marks arc thought to serve. Wc might do this by 
asking who are interested in marks and why. 

Certainly the teachers are interested in them, although their job might 
prc.scnt less frustration il report cards were discarded. For the most part, they 
use them in valid ways — to tell their pupils what thc\ think of their work and 
to determine what has been their achievement in other subjects and m earlier 
grades. In addition some teachers like to “motivate" pupils by promising 
for good wcvrk and Ps for slothfulness. The few who arc vindictive can punish 
pupils who have bothered them by giving such pupils low grades. If they arc 
guidance-minded, they may attempt to use marks to promote pupil adjustment. 

The pupils, themselves, are of course interested in the grades they receive. 
1'hey wish to know how well they arc progressing and in what areas and aetiv- 

- The Taxotiomy of Ediuational Ohiectives (2), cited in ( hapter 2 tor its an.dysis 
of dimensions of knowledge, includes as well an attempt to establish levels of perform- 
ance lelative to these dimensions. This book is highly Tccommended to all tcdchcrs w'ho 
wish to set valid evaluative standards in their teaching. It might be well to compaie the 
hierarchy of performance Irvels pioposcd by the college examineis w'lth that pioposed 
by the authors here. 



206 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


ities they need further development. I’hey need the information provided by 
marks so that they can make realistic educational and vocational plans for the 
future. And, of course, in secondary schools there are honor rolls, athletic 
eligibility, and many other things that hinge on school marks. 

The parents of the pupils arc extremely concerned about the marks given 
their children. Among the reasons for their concern arc genuine interest in 
their children's pi ogress, the prestige value ol high marks, the ‘Toss of face” 
attendant upon T’s, and their need, like that of their children, to have some 
basis for educational and vocational planning. 

All superintendents and principals must be interested in marks for admin- 
istrative rcason>. They form an important part "‘f the cumulative record for 
each pupil. Grades are used as a basis lor promotion or nonpromotion, and 
for the determination of honor rolls. High schools arc interested in seeing if a 
student’s elementary marks justify his cprollmcnt in given subjects. Colleges 
often use high school marks as a basis or a partial basis for admission. Schools 
also must be ready to provide information on the academic standing of a stu- 
dent to a potential employer when requested. Consequently, in view of all 
these administrative uses of grades, marks are well-nigh indispensable to prin- 
cipals and supenntendents, 

A tirial group who may wish to know a pupil's school grades is composed 
of prospective employers Over-all high marks suggest a capable worker and 
over-all low oiie^, a poor worker. Moreover, certain types of Employers are 
interested in gnides in given subjects, the insurance firm in typing and book- 
keeping, the sales firm in public speaking, the chemical plant in science and 
mathematics. 

So far we lia\e identified five group^ of persons who are concerned about 
grades, iiamcls ■ the teachers, the pupils, the parents, the school admmistiators, 
and the potential em[Toyers. The reasons loi their inteiest arc m eilect the 
functions of school marks. In summary, these seem to be; 

1. Indicate academic standing and competence. 

2. Facilitate instruction and guidance. 

3. Provide motivation for learning 

4. Serve as a basis for luturc planning. 

5. Serve administratively for placement, promotion, ceitificalicui, admis- 
sion, and for permanent records 

6. Ser'^e as picdiclors ot schocT and vocational success. 

From this brief resume of who arc interested in grades and for what rea- 
son, we can readily see why marking has become so eomplica'ed and confused. 
There are many difTcrent persons and many different functions to be served 
by the same marks. Yet the type of iTiark that best serves one group or func- 
tion may least serve another. 



EVALUATIVE STANDARDS 


207 


Marking and Reporting Practices *’ 

In day-by-day and weck-by-wcek contact between pupils and teacher, the 
evaluation process outlined earlier in this chapter should serve chectivcly. The 
specific objectives for the classroom activities would be care! ally defined, an 
evaluative standard consisting of levels of performance for each objective 
would be prepared, a device for measuring performance at each level would 
be developed and administered, and evaluative ratings would be assigned 
accordingly. For each day and each week, the pupils would know what 
progress has been made and what has been accomplished. The pupils could 
readily tell at what level they were performing and what needs to be done in 
order to perform at a higher level. The teacher could know how eflectivc the 
instructional activity has been and what areas need further emphasis. The 
qualitative ratings assigned by the teacher would ha'** c clear-cut meaning and 
there w'ould be a continual two-way communication between teacher and 
pupils. In this case the functions of evaluation would be cflectivelv served. 

1 rouble comes when the results ol classroom evaluation are to be re|U)rlcd 
to parents, school authorities, and to the pubhe, particularly in the form of a 
single mark at the end of a course, fhese persons, of course, have not seen 
day-by-day and week-by-week evaluations made in the classroom. They tend 
to be unaware of the complex natuie of achievement in any subject, of the 
tentative natuie of each tcuu her judgment, and ol both the teacher's objectixes 
and his standards All they see is a semester grade and whatcvei meaning they 
give it has an indeienninale error. 

SINGLE LI IIIK MARKING 

The })raclicc of using a single IcUcr, usually A, B, C, D, or f as a qualitu- 
livc rating lor a studaU\ perlormancc in a subiccl oi aclixity has been seri- 
ously qiicsiioncd by many, 1 heir criticisms seem to center around the follow- 
ing points. 

1 Fach subject and each activity in schoi‘l consists of main diverse 
aspects and c ich requires a variety oi dillcrent skills and abilities It hardly 
seems reasonable that a single mark could truly represent quality ol perform- 
ance in all these aspects. At best it must represent an average oi qualitative 
ratings for all ol them. 

2. A single mark actually communicates very little, allows (ui no explana- 
tion, and pre^vides only a limited basis for action. I hesc marks scarcely indi- 
cate the pupil's stiong points and w^cak points, and i^crlainly they goe hardly 
any indication of the pupil’s potentiality. Without this infornritioii there is 
little on which to base any positive action, 

3. Letter marks arc often construca as rewards and punishments for the 
pupils and as prestige symbols by their parents and tlie public. When thus 


•^Samples of several report cards aic shown in Appendix ( . 



208 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


construed, grades have become ends in themselves, something to be achieved 
for their own sake instead of serving to facilitate learning. We should not have 
to look far to find children who are going through “motions” in the classroom 
just to achieve high marks and who are not concerned about learning anything. 
This is extrinsic motivation at its worst. 

4. There is evidence that grades may have inconsistent meanings. Studies 
have shown that different teachers will rate differently identical pcii'ormances 
and that the same teacher will rate the same performance differently on differ- 
ent occasions. Since no common meaning has been established for these 
marks, this certainly makes it difficult for parents and other outsiders to deter- 
mine just what the marks do mean. The prevalent unreliability of marks also 
results in their limited use for predicting future academic success, for place- 
ment in courses, and in general as a basis for future planning. 

Attempts have been made to de-emphasize the role of marks and thus to 
remove the possibility of their becoming ends in themselves. This has gen- 
erally been done by reducing the number of letters u.scd in marking and by 
changing their meaning. For example, the A, B, C, D, E, F letters are replaced 
by the letters S and IJ, S for satisfactory and II for unsatisfactory. I his has 
certainly dc-emphasized the role of marks and has relieved the pressure of 
competition on the pupils. On the other hand, the problem of comniunicaling 
the accomplishments of the pupils has only been aggravated. The use of only 
two letters provides less information regarding the accomplishment and prog- 
ress of pupils. In general, schools that have tried the experiment of reducing 
the number of letters used as marks have reported that the results were un- 
satisfactory, particularly in regard to their responsibility for providing informa- 
tion to parents and others about the competence of their pupils. 

ANALYTIC LVALL’ATION 

Schools have also sought answers to the objection that single letter marks 
actually convey little information and provide little basis for action or proper 
interpretation. One approach to this problem has been to list specific dimen- 
sions or aspects of a subject or activity and then to piovide qualitative ratings 
for each. For example, in social studies achievement the following dimensions 
could be li.sted; 

1. Developing a special vocabulary. 

2. Recognizing events and their chronological relationships. 

3. Reading and interpreting graphs, tables, and maps. 

4. Locating and evaluating sources of infoimation. 

5. Analyzing social problems. 

This procedure has much to commend it as a way of providing more in- 
formation, but some drawbacks exist. For one thing, the number of specific 
components for each subject or activity can become very large and hence the 
amount of measurement involved becomes overly extensive. Then the job of 



EVALUATIVE STANDARDS 


209 


evaluating each pupil becomes too time-consuming and complex for sdic 
teacher. Another but smaller difficulty lies in wording each aspect so that 
parents and others can understand what is being rated. If these difficulties can 
be overcome, the use of a check list of components for each subject has great 
potentialities. 

PARENT-TEACHER CONI ERTNCES 

Another procedure used by schools in order to overcome the limited com- 
munication of a report card is to establish periodic parent-teacher conferences. 
This procedure has the advantage of a two-way personal interview m which 
the status of the pup^ils can be discussed inlormally and completely A series 
of conferences of this sort requires careful scheduling and they are very ex- 
pensive of time. Moreover, such conferences require a minimum level of 
competence on the part of the teacher if they are to be successful. Experience 
with the procedure has indicated that, in general, parents are initially enthu- 
siastic about the conference but as soon as the newness vicars off their interest 
decreases and subsequently their attendance diops off Some schools use an 
informal letter to the paient in order to offset some of the dilliculties en- 
countered with the parent-tcachcr conlercnce; however, this again may make 
evaluation an excessive burden to the teacher. 

ROLF OF SIANDSRDIZID TEblS IN IVAIUATION 

Stant^ardized tests have been particularly uselul in helping to relieve the 
heavy import carried by school marks. These tests have been especially useful 
for predicting future academic success and for placement in courses College's 
are more and more relying on a combination ol standardized test scores and 
high school marks as a basis for determining those who are mo t likely to 
succeed in college, rath^T than i dying solely on high school marks Stand- 
ardized tests are also useful in helping teachers check their grading and 
thereby give some consistency to their marks. There is an extreme point of 
view, with which the authors disagree, that '-uch tests should be the sole deter- 
minants of final marks in courses. One can readily see how these tests may 
supplement the regular evaluation process but it is difficult to see how they 
can supplant it. 

WHAT I HE ILllLR MARK MAY RFFKESEN F 

The practice of giving a single mark as a qualitative rating for a course 
seems to be deeply rooted in our present-day educational system and there 
seems to be no indication that this practice will be drastically changed in the 
near future. This raises the question, then, of what the single mark should 
represent In assigning a single mark to a pupil, should the teacher give con- 
sideration to the pupil’s effort, capabilities, initiative, co-operation, and to 
the effect it might have on the pupil and his parents? In a recent survey of 53 
California school systems of various sizes, 27 school systems reported that 



210 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


they believed that the basis for awarding subject-area marks should be a 
composite of subject matter achievement, co-operation, effort, and initia- 
tive (4). 

This is contrary to the principles set forth earlier in this text, that any 
symbol is meanmglul only when it refers to one dimension. This means that 
the single course mark should be devoted only to evaluating status as to com- 
petence in the subject. All other considerations should be rigorously excluded. 
This, of course, is much easier said than done, for we teachers are human 
and the temptation is strong to consider the effect the marks will have on the 
pupils and to use marks lor rewards and punishment. In order to reduce this 
temptation and also to provide for the evaluation of “citizenship,” some 
schools include certain nonacademic dimensions on their report cards and 
rate them separately. \ hese include such items as citizenship, work habits, 
attendance, co-operation, conduct, cllort, responsibility, and attitude. 

Criteria for a Marking and Reporting System 

Determination of a thoroughly satisfactory report card and/or marking 
system thus seems to lie in the future. However, in our criticism of current 
practices there have been implied certain criteria for more valid marking and 
reporting. Some of these may be met by carelully handling any conventional 
system. To accommodate to others will require procedural innovations, re- 
leased teacher time, <idditional training on the part of teachers, etc. 

J. Marks in given sub feels should be based on extensive and comprehen- 
sive ineasurenient. A pupil’s achievement in a subject — uilhinetic, reading, 
biology, or shop — is his perlormance over a span of time and consists of many 
dimensions. To measure it onl} once (say, a final examination), to measure 
only one dimension ('•ay, memory of facts), or to use but one measuring 
procedure (say, a guided response test) is to assume that any pupil perform- 
ance is clear evidence of all the pupil’s pcrlormances. This, of course, is known 
to be a false assumption so, to be valid, marking must be based on extensive 
and varied measurement of all essential aNpects of the achievement in question. 

2. J he marks must symbolize a comparison between pupil status and 
known and fan standards. As we have emphasized throughout this chapter, 
the sine cpia non oi effective evaluation is the evaluative standard. Without 
one, evaluation must be subjective, capricious, and erratic, l^or the teacher to 
have a standard, though, is not enough for pupils and parents. They wish to 
know what it is and to know that it is fair. Discussion with pupils may serve 
to inform them but parents should have the standard(s) in willing. This may 
be on the face of the report card or on a separate form, but it should accom- 
pany transmission of grades. Fairness in a standard hardly needs justification 
but in gaining it two considerations are critical. As wc have asserted, the 
maturation of pupils in the grade in question must be a limiting factor in 
setting standards. Moreover, standards should be derived from the perform- 
ance of pupils rather than from the performance of the teacher. The latter’s 



EVALUATIVE STANDARDS 


211 


performance almost by definition is sure to outstrip that of the best pupils, 
and consequently, if it is used as a rigid standard, even the best pupils may 
receive unduly low marks. 

3. The marking and reporting .system should he sotnewhat diagnostic. 
Since learning is best promoted by detailed knowledge of right and wrong 
actions, generalized evaluations do little to promote it. Hie basic components 
or dimensions of performance in English, science, history, etc., must be 
marked separately it the pupil is to know what to exploit, what to lorgct, and 
what to correct. Earlier, on pages 29-32, 93, we cited several examples of 
subject breakdown. In the next section of the book, the necessar\ analysis is 
described for most school subjects. 

4. Marks and repot ts must help the pupd assess his accomplishments 
realistically. Much has been said about assuring success for pupils and mini- 
mizing frustiJtion lor them. With this stand we agree, but wc think that great 
care should be taken in extending the principle to marking practice. For a 
student to receive a mark that symbolizes more accomplishment than he has 
actually achieved is, wc I’ccl, a disservice to the child He will tend to oven ate 
his competence and thus may be sharply disappointed when he tries it out in 
a less benevolent circumstance. On the other hand, good pupils who are 
marked low simply to prod I hem to more accomplishment may for that reason 
underrate their talent and fail to aspire as high as they should, 

We consider that the primary 1 unction of marks is to communicate a 
realistic evaluation. In doing this it may be appiopriate to mark both on the 
basis of achievement and on gain or learning. It seems that for most subjects 
pupils profit from knowing how' much they can do at the moment and how 
much progress they have made, so a separate mark might well he given foi 
each t>pe ol evaluation. Certainly, the pupil and parent should 11(4 be left to 
guess which is the basis nor should some pupiK m the same class he marked 
on gain and others on level ol achievcmeiiT 

The frequent practice of dillcrcntial marking aceording to abililv, if not 
carefully administered, can produce umealistic seli-evaluations on the part 
of pupils. Jn this procedure each pupil is |udgcd according to how much rna) 
be expected of him. If the dull pupil lives up to expectation, he receives the 
same maik as the bright pupil who lives up lu e.xpcctaiion. Yet, by deliiution, 
the pupils' achievements are entirelv diflerent Some ol the misleading aspect 
of ‘'marking on ability” can be avoided by adding subscripts to the grades, 
i.c.. A' for bright pupils, y tor average, and :: lor dull ones, or some such 

Ihc primary virtue of ability marking is its avoidance of evmipetition 
among ill-matched pupils. Wc feel that the use of dual marks, one for absolute 
achievement and one for gam, will serve to minimize this unfair competitive 
aspect of marking and still be realistic. The poorest pupil feels "success’’ over 
his progress but still recognizes the low level ol his competence. Fhc brightest 
pupil, on the other hand, can be reprimanded by an F tor his lack of progress 
but still be told by an A that his achievement is nonetheless of high order. 



212 


FUNDAMENTAL CONCEPTIONS AND PROCEDURES 


5. The symbols and the report form must he understandable to pupils and 
parents. Obviously the meaning of marks should be clear and constant from 
year to year and from teacher to teacher. Such technical terms as percentile 
and standard score should be avoided. All entries on the forms used must be 
immediately meaningful or there defined. 

6. Finally, the preparation and admiuistratioji of report cards or the alter- 
natives of letters and conferences should entail no more time than is available 
for the purpose. Teachers should not he expected to work overtime to achieve 
eflective reporting. Even if they arc willing to do so. their cfiicicncy will drop 
and much of the value of the procedure will be vitiated. If school boards and 
administrators cannot provide time enough in the regular school day for 
teachers to adhere to some of these criteria that involve additional lime, it 
should be anticipated that marking and reporting simply will fall short of the 
criteria. 

Summary 

Evaluation is the process whereby the quality or value of anything is deter- 
mined. Usually, this involves a comparison ol the status of the thing in ques- 
tion with a standard appropriate to a determination of its value. Among the 
standards in use in schools arc purposes, averages, arbitrary conceptions, and 
customs. A pcrt\)rmance scale is considered I he best type ol standard for 
evaluating achievement. 

I’o be satis! aetory, evaluative standards should . 

1. Present a clear-cut variation of qi'ality or value with dillerent cate- 
gories of value well defined: 

2. Be practical for use and expressed in terms appropriate and favorable 
to the best procedures ol measurement; 

3. Be consistent with scienlihc findings about learning and child develop- 
ment; and 

4. Be consistent with the values of our culture 

Evaluations are expressed m much the same forms as measurement, 
namely, by classificaior\ and descriptive symbols, by indexes ol rank, and by 
scale numbers. 'The last type of evaluative symbol is rarelv possible in educa- 
tion. Whatever the form of expression, it is essential that the symbol be related 
to unambiguous and appropriate statements about iho quality gradations they 
represent. 

To accomplish valid evaluation of pupil achievement it is necessary first 
to determine what dimensions of achievement are ti) be judged. Second, a 
standard should be selected or prepared appropriate to the ciimcnsions. In the 
third step, needed measuring devices should be procured or prepared and 
designed according to the levels contained in the evaluative standard. Finally, 
measurement must be performed and the evaluations made and communicated 
to pupils and others who need to know them. 



EVALUATIVE STANDARDS 


213 


A generalized standard proposed lor use in evaluating pupil understanding 
of any subject is composed ol five levels of performance. I. imitating, duplicat- 
ing, repeating; II. Level 1 plus recognizing, identifying, remembering, recall- 
ing, classifying; III. Levels I and II plus comparing, relating, discriminating, 
reformulating, illustrating, IV. Levels I-III plus explaining, justifying, pre- 
dicting, estimating, interpreting, drawing inferences, and V Levels 1~1V 
plus creating, discovcnmi, reorganizing, formulating new questions, hypoth- 
eses, and pioblcms 

School marks, on tests, papers, homework, and the semester’s work, are 
symbols ol teachers’ evaluations of pupil achievement As such, they serve 
to lacilitatc instiLiction and guidance, motivate stiich, serve as a basis for 
future planning, loi placement, promotion, and admission, and tor prognosis 
vl school and vocational hicccss Tradition.i! practices in marking have many 
weaknesses but no cntiicly satisfactory new method has yet been devised. 
Among the eonditions essential loi a valid marking and reporting system are 
these Items. Marks should be based on exlensivu and comprehensive meas- 
urement They mint symbolize a comparison between pupil status and known 
and fair standards flic sjstcm should be somewhat diagnostic It should help 
the pupil assess his accomplishments realisticallv The meaning of svmbols 
must be clear to pupils ind parents I in illy, the m<irking and reporting process 
should put no i xccssivc buiden on the teacher 


FXI RCISES 

1 Compare the hieraichv ol performance levels as presented hv Bloom (see 
bibliogiaphv ) with the hieiaich) ..s suggested in this ti \t 

'I Piepare a g^iicrali/vd set nl cvalua'o^. stand nils in ihe loim ol successive 
levels ol pLilormancC that wouid be gcuci ’lU applicable to vou teaching area 
^ Sclct t a specific ohiCwiivc in your UaJnm: aici, develop a set ot evaluative 
slandards toi this objective md C(>nsliucl a levs illnsii alive Hems dial would he 
cvmtaincd in \oui mc.i>uhiig insliunicnt 

4. fKaminc ca^tlull> J Last three re pot t caids that aic in use (>ou ma> 
refer to the loport cuds piesented in Appendix L if voa aic unable to oh. am vour 
own copies 1 Make a critical anal>sis ot these leport caids in the light o Ihc dis- 
cussion ol this lot 

S Develop whal vou believe to be an ideal ropoit caid fen a course )oii teach 
or plan to teach 

6. .Suppose \ou have been selcclcd lo serve on a comnnllce that plans to stud) 
the marking practie's in voui sclmol Whal rcconimendalic ns would vou make to 
insure uniiorm nun king thionghout vour schooD 




SECTION II 


CUSTOMARY USES OF 
MEASUREMENT AND EVALUATION 
IN EDUCATION 

OVERVIEW 

I he general pioecsses ol behavior il nieasurLmcnl ind some problems in 
\olvi.d ihtrein h ive been picseiilcd m the lirst sietion ol this IlM In addition 
some attention his been goeii to thw iliiation ol pupil achievement and to 
elk^tive modes ol iw^portina exaluatioiis to pupils and pn«.nts Our purpose 
has bten to develop a eneraii/ed '.eheme lor me isurement ind t evaluation 
Ih it is v ilid and will m ikt lor niort ellieient evdiiative praetiecs in schools 
Now we wish to ipph this scheme Ic* the tasks ol edueational me isurement 
and evaluation that teaehtrs and administrators mu»t pertorm 

Never before in hi>toi\ hive selioolnien beui so iiUiested n aei^urale 
assessments ol jehiC'ement and the 1 lelors relitv.d to u mle’h ^ nee per 
sonalitv adjustment u idmg In the eoinplo oiiains ol this eireum^t inee, two 
laetors arc paiaaiount In the liisi place ^.dL’e itioiial practice has c*l last been 
Uivcn some seienlifie undeipinmn^ Out el the 1 iboi itorie^ ol M unut Watson, 
Ihorndikc Skinner Yc ik^ s uid others hive eo ik cKp^nmentil tmdings 
about learninc th it suppoit a di^in inslrahl) ePcetiVL metnodolen^^ oi nistrue 
lion As pill ol this seieniihe- movemeni in education both the favteus tha^ 
alleet learning in the ‘schools and d.e outeones ol iiislruetion haVe Ken ex 
peeted to yield up their secrets to I’l ptobing ol me..surenKnt Moieovn the 
analogv of the laboratoi v and the classroom oi rcociieh and ot v\er\diy in 
struetjon, has largely been av.eepted It precise measurement has been applied 
to motivation, practice, t^ror learning and relunt'on bv the use ireher should 
It not also be applied by the teacher^ 

As we know, ot course the great pioimsc ol obj^wti e me isurLinent in 
education has not been entirely tullillcd The tests hi^e not been Lompletelv 
reliable nor has their locus been exact leaehers and adminisir itois have 
lacked the time and skill to do all the things the psychomctrists have said must 
be done And, unlortunaUl\ many ol the tesimg practices of the 1910 s and 

215 



216 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


1920’s have been found to rest on erroneous statistical and psychological as- 
sumptions. However, to offset these misadventures, there has been continued 
refinement of testing procedures and instruments. With decreased dependence 
on the “objective” test (guided response) has come increased efficiency in the 
use of essay examinations (free response). And a great variety of novel meas- 
uring techniques has been developed in pace with the school's widening interest 
in understanding, critical thinking, socially desirable attitudes, and the like. 

Of equal force, •jperhaps, in increasing the importance of measurement in 
the schools are two relatively new aspects of American public schools, guid- 
ance and ability grouping. The counselor requires IQ's, reading grade place- 
ments, percentile ranks on interest tests, and cumulative recoids if he is to 
guide his students toward proper vocational goals and to help them with their 
problems of personal adjustment. For efficient ability grouping, whether within 
a class or among classes, it is necessary to have reliable information about 
the general ability and the subject readiness of each pupil. 

If the measurement and evaluation schoolmen now must periorm is to be 
efficient, it is thought that it should be consonant with some rational and sys- 
tematic approach. Testing and marking is now too complex, too ramified, and 
too important in the lives of pupils lor it to be practiced casually 1 he scheme 
presented in the first section is designed to priiduce more efficienc} and validity 
in the evaluation of pupils in various subjects. It has been developed in full 
knowledge of the man> problems involved in marking assignments, scoring 
tests, observing laboratory performance, and giving grades. It oilers no pat 
solutions or panaceas foi these problems, but it does oflei a rational approach 
to their solution. 

In this section attention will be given in ordei to Ihe Language Arts and 
its many specific subjects, to Social Studies, to Science and Mathematics, and 
then to the pcrformance-acti\itv subjects of Music, Art, Physical Lducation, 
etc. For each of these subjects, measurable dimensions are discussed, appro- 
priate measuring forms and procedures are described and illustrated, evalu- 
ative standards arc presented, and marking practices are noted The measure- 
ment of intelligence is examined subsequent to this, as is measurement applied 
to personality and character. The section and the book conclude with a dis- 
cussion of school-wide testing programs. 

No attempt has been made in the section to deal with every possible aspect 
of measurement and evaluation for every school subject. To some extent, our 
coverage of subjects is a function of the amount of research that has been 
devoted to their measurement. In addition, we have tried to consider the usual 
training programs for difTerent types of teachers. Thus we have dealt less ex- 
haustively with the highly specialized subjects (art, music, physical education, 
etc.) than with the more general ones, since the training of teachers for these 
special subjects usually involves specific instruction in measurement and evalu- 
ation as applied to the subjects. T hen for each subject area we have tried to 
emphasize the procedures most significant for the area. Finally, we have not 



CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


217 


attempted to repeat discussions of procedure and principle where their dis- 
cussion in one context may be easily understood and applied to another 
In view of this, it may be advisable to study all of certain chapters and 
portions of others in addition to the one representing your specialization To 
help you determine what you will want to read, we have outlined below the 
special emphasis in each of the chapters on subject areas and portions that 
may have general significance The three remaining ch^ters — Intelligence, 
Personality and Character and School-wide 1 estmg ProgiSns — arc, of course, 
pertinent to teaching at any level or m any subject 


Special Finphascs 
Chaptir 10 -I am^uai^c ifl\ 
Guuicd usponsc techniques 
Stand mli/td tests 
Lhciptcr // Social StiicJus 
Free icsponsc techniques 
And\sis of wriUcn pioducls 


Chaptet 12- Scicn c end Mcdl c niatic s 
(juidcd icsponsc techniques 


Passages of General Signifitance 


Reading, pagis 222-2'^h 
Composition, pages 2'^S 250 


Lvaluitnc standards, pages 272-274 

( iti/cnship and psydiolo ical gi .iding, 
p iizcs 275, 276 

The pioiot>pc stucl> of dl phases ol 
measurement ^ nd c\ duation in an 
eighth giadc cl iss p iges 278 291 
pailicLilirlv thi o\er all plan, patze^ 
282 and ittention to studv 

habiis p acs 2S6 289 


I V duativc stand ircis, pigt^s K)(), ^13 
319 


Pcilomnncc >c ik <.\iluiu\c s' u\d 
ards 

Chapter n -PttUnuaicc 4cfnifv iims 

Obscivipon tLchniqucs and anal i'- Pcitoimancf* tests pigc*' 3 '^9 3s4 
ot grapliic products ii d utiticts 

Pcifoimancc tests and pcifoimance 
rating scales 




CHAPTER 10 


LANGUAGE ARTS 


Language, in its various aspects, is the modus operand! for learning in 
the schools. Pupils and teachers talk and listen, books are read and papers 
written with a view to learning given things — arithmetic, history, governmental 
forms, social problems, science. In addition, the outcomes in these subjects, 
the things learned, are themselves language in many cases — the basic symbols 
or vocabulary ot a subject, concepts, statements of fact and iclationship, 
theories, etc. Moreover, the skills of language are objects of instruction in the 
elementary grades. In secondary grades, to these arc added the artifacts of 
language, novels, poems, essays and short stories; plays, new stories, biogra- 
phies, and propaganda 

In consequence of this fact that our schools are primarily language schools, 
in one form or another, the nicasurcnicnt and evaluation of language art:> 
phenomena is a matter of great concern in all school grades In the lower 
grades where the development of rudimentary language skills are of critical 
importance, primary attention is given to reading, handwriting, and vocabu- 
lary. In the upper grades the measurement of these phenomena per se is less 
significant and spelling, punctuation, grammar, and composition aie the phe- 
nomena more oftci* evaluated. Then, in secondary school and in college, 
specialized aspects ot language become the major objectives of evaluation — 
public speaking, literature, creative writing, foreign languages, dramatics, and 
journalism. 

Our plan of presentation in this chaptci is first to describe some general 
features of measurement and evaluation in language arts and then to discuss 
their application to each language skill or subject. In both the generd and 
particular treatments, the urganizalion of the lirst section ol the book will be 
followed. The dimensions of ♦^he phenomena vmII be presented, then the forms 
and procedures of measurement that are most lelevanl. After this, attention 
will be given to applicable evaluative standards and finally to problems of 
marking and reptirting. 


219 



220 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


GENERAL FEATURES 


Measurable Dimensions 

Each of the language arts subjects has several measurable dimensions 
Collectively, these have but two common characteristics they are largely 
symbolic benaviors and, jor the most part, they are overt behaviors Being 
overt, the problems of interred dimensions and constructs arc rarely encoun- 
tered (see piigcs 26-28) Because they are symbolic the matter of clear 
definition is perhaps the most critical condition to be met m preparing to 
measure them 

When an attempt is made to mcasuie language arts achievement as a 
whole, the elements ol language arts aie themselves the basic dimensions for 
which test scores or ratings are sought Reading, spelling, vocabulary, gram- 
mar, and knowledge of literature several or all arc the locus ot general lan- 
guage achievement tests A pupils status with respect to the particular 
dimensions of each element usually is not assessed separately 

Forms of Measurement 

While some of the various language arts entail special measurement ex- 
pressions, nearly all arc limited to the precision ol classifieHtion and rank 
symbols In some cases scores from tcachcr-deviscd tests may be conv cited 
to percentile ranks and nearly all standardized tests yield pcrccntilc ranks or 
some other index of rank in a 1 irgc population 

Even though the differences among pupils in most ol the dimensions of 
language arts are a matter ol degree or ol continuous variation, scale measure- 
ments are haidly applicable to them Scale measures require fixed points of 
reference and regularized units of difference, and these are hard to contrive lor 
language arts It is true that so-called scale’ scores aie produced by some 
tests and man> rating “scales loi language arts aic m existence However, the 
use of the term usually is figurative lather than literal A pupil’s rate ot read- 
ing may, of course, be expressed in words- per-minute a scale index as can his 
rate of writing or talking 

Procedures of Measurement 

Appraisal of progress in language arts subjects, more often than not, is 
based on informal and even incidental observation and product analysis The 
marks that teachers give pupils arc heavily influenced by what they see and hear 
as pupils express their ideas m discussion, study, write papers, and use the 
library The pupils’ paragraphs and compositions, their fill-ms of work sheet 
pages, and their notebooks are inspected by teachers as a further basis for 
marking Thus, the unreliability of subjective judgments is a problem m lan- 
guage arts evaluation But, on the other hand, the inherent validity of direct 
observation and product analysis is a compensating boon 



LANGUAGE ARTS 


221 


Both free and guided response tests are used widely, particularly for 
aptitude, instructional classification, diagnosis of difficulty, and for other 
purposes that do not involve marks and report cards. In addition to his own 
tests, a teacher of a language arts subject has available a large number of 
published standardized tests. Table 9 shows the great number listed in the 
Fourth Menial Measurements Yearbook (1). 

TABLE 9 

Published Tests listed foi Language Arts 
Areas in the Fourth Mental Measurements Yearbook 


English in general . . 30 

C omposition 2 

Liter atm c 18 

Spelling 15 

V\)cabulary 6 

Reading . . .... 36 

Oral I 

Readiness 7 

Special fields 5 

Study skills It 

Total 131 


The general advantages and disadvantages wc noted for standardized tests 
(page 123) hold true in language arts. Iheir use permits comparisons with a 
larger population of pupils than locally devised tests and their construction is 
likely to be superior; but they may not suit cither the objectives or content 
of a particular class and they often conru)tc more '^scientific'’ measurement 
than they attain. As wc shall see in succeeding sections, standardized tests 
are far more valid for some language arts subjects than others. In reading, 
for example, they arc wcll-nijh indispenvsablc but m composition and literature 
their worth is often questionable. 

Evaluative Standards, Marks, and Reporting 

In one way, evaluation in language arts subjects may be fairly business- 
like and objective. Dictionaries and style books have recorded accepted word 
meanings, spellings, punctuation and, to some extent, usage. The teacher may 
judge papers submitted by pupils adversely as they differ from these standards 
and commend them as they approximate the standards. In another way, 
though, the evaluatitm of language arts phenomena may be vague, subjective, 
and even arbitrary The other standards by which pupils' reading, writing, and 
talking are judged are such things as ability level expectancies, course of 
study objectives and standards, and teachers' opinions of what is “proper'' for 
a fifth grader or a twelfth grader. These standards often are ill-defined; they 
vary from region to region and school to school; they may not be written 



222 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


down; and, in many instances, they derive from custom or precedent rather 
than utility. 

Marks rarely are given for language arts as a subject of study (it is a 
collective term) but rather for reading, spelling, English, journalism, etc. Prac- 
tices of marking in language arts subjects display all the variety and common 
tendencies we noted lor marking in general in the previous chapter. 

As in other broad subject areas, many school districts have formulated 
evaluative standards, policies, and techniques tor district-wide and multigrade 
use. These activities arc considered extiemely valuable, even though the 
evaluation plans achieved may have technical flaws. Only through such broad 
and cross-grade planning can any sort of compaiable and accumulative evalua- 
tions be attained for the pupils who go through the school system. If evalu- 
ations are not comparable from grade to grade and are not cumulative, pupils 
are bewildered and teachers lack essential information. 


READING^ 


Reading Readiness 

Even before children open the first prepnmer in the first semester ot the 
first grade, measurement usuallv has begun in reading. It has been known 
for a long time that a given level of rnaluiity must be attained by a child 
before he is capable of learning to read While important variation is to be 
noted m individual cases, by and large a child is read) to profit from instruc- 
tion in reading only when he has the mental age of six )cars, six months, and 
possesses the social, emotional, physical, and experiential attributes that tend 
to accompany this degree of mental development in middle-class native-born 
American children (\mseqLiently, perhaps the fust mcasuicment task that con- 
fronts a first-grade teacher “ is to gauge her pupils' readiness for reading 
according to this general critciion. 

DIMLNSIONS OF RLADING RFADINESS 

No given list of dimensions ot leading readiness has yet been agreed to 
by authorities in reading. Mental maturity, visual and auditory acuity, com- 
petence in oral language, and experience relative to the objects and ideas in 
first readers seem to be the most commonly cited factors. The many published 

1 Measuicment of leading status, aptitude, and weaknesses is intimcdely related to 
instruction in reading. The many cunent textbooks on leading methodology contain 
extensive material on measuicment and should be consulted by ary person who wishes 
to make a thorough study of measurement and evaluation in reading. Moreover, teachers' 
manuals that accompany a series of readers usually suggest appropriate evaluative 
devices and may even present ready -to-go tests. 

- That this section is presented from the point of view of a first-grade teacher docs 
not mean that readiness is of no concern in later grades. For any child who has yet to 
read readiness should be gauged at whatever grade. 



LANGUAGE ARTS 


223 


tests of reading readiness differ among thenuselves as to the dimensions they 
attempt to measure. As a rule, the tests are directed toward certain factors 
that also arc measured by intelligence tests but add others that relate par- 
ticularly to reading. The dimensions that two rather different types of readiness 
test purport to measure are listed bclow/^ 

Metropolitan Murphy-Durrcll 

Word meaning (oral) Auditory discrimination (likenesses and 

Sentence meaning (oral) differences of letters and words) 

Inlormation Visual discrimination (likenesses and 

Visual discrimination of object like- differences of letters and words) 

nesses and differences 

Knowledge of number Rate of learning of vocabu!ar> 

Ability to copy 

FORMS AND PROC LDURLS OF MEASURI MENI FOR READING READINESS 

By definition, reading readiness is a composite of many variables, both 
immediate and historical; and status relative to each of them usually is ex- 
pressed in a different form. For mental maturity it is customary to use Mental 
Age, a rank symbol. Visual and auditory acuity may be expressed in Snellen 
chart ratios and in watch tick or numbers heard distances. They can be, but 
usually arc not lor school purposes, expressed by the piccise scale numbers 
of oculist’s instruments and an audiometer test. Oral language ability usually 
IS merely described m words classified or, at best, is reflected in the percentile 
scores of a readiness test. Experience is necessarily gauged b\ verbal de- 
scription and by rough classifications. 

Because of a lack of unilorm indexes of measurement for the several 
variables of readiness, no single measure may be assigned to a pupil’s readi- 
ness as a whole. Hence, the measurement of reading readiness is largely a 
diagnostic procedure. It is like a physical examination or a case study. It is 
not like determining a chi’J’s weight or even his intelligence or iacility at 
spelling. While, as vve shall see, reading icadiness tests may yield single scores, 
they do not by any means appraise all the important dimensions of readiness. 

The procedures by which this diagnosis is to be accomplished are as varied 
as the forms of measurement to be applied to its components. 

Mental Matutitv. The best measurement of mental maturity is, of course, 
a carefully administered intelligence test (see Chapter 14 and Appendix B). 
A group lest should suffice to detect those vvho have the requisite Mental Age 
but individual tests should be ordered to verify findings for pupils with Mental 
Ages apparently too low for reading. If intelligence tests cannot be used for 
some reason, recourse is necessary to the first-grade teacher’s observation of 
the pupil during the first weeks of school and possibly parent conferences and 

^ These tests along with others are described in Appendix B. 



224 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

neighborhood visits. While this latter means can yield no number score, it can 
yield a verbal estimate of the child’s brightness relative to members of his 
peer group. With experienced teachers, estimates of relative iiilelligence can 
have surprising accuracy. 

Sensory Acuity, The linic-honorcd lest of visual acuity is the Snellen or 
similar vision chart. However, many children will have 20/20 vision so far as 
the chart is concerned, hut still are not able to focus properly on reading 
material 18 inches from their eyes. The chart test measures visual acuity at a 
distance while reading requires acuity of near vision. Children, as we know, 
do not develop the ocular adaptation for fixating clearly on small objects close 
at hand until about the time they enter school. Thus, the ordinary Snellen 
chart eye examination should be supplemented by an oculist’s examination for 
any child with normal distance vision who has any difilculty with scat work. 
It presently is customary and advisable to refer to an oculist any child whose 
distance vision is appreciably less than normal. 

Again, for hearing, there is a traditional test, the watch tick or whispered 
numbers. But as with vision, the gross results of the traditional procedure 
often fail to show the diflcrential aspects of hearing that arc important for 
speech. Consequently, use of a variable frequency audiometer is desirable. 
This instrument broadcasts tones whose pitch, as well as amplitude, may be 
controlled by the examiner. With it a child’s ability to hear may be determined 
for all audible frequencies. School nurses are being trained^today in the use 
of the audiometer, as are school psychometrists and speech clinicians.^ 

Oral Language. The third general dimension of concern in reading readi- 
ness, oral language ability, may be appraised in several ways. Many reading 
readiness tests contain vocabulary and sentence-meaning sections. Casual 
listening to pupils’ conversation at recess and during activity periods usually 
will serve to detect those whose language development is accelerated and, as 
well, those who may be retarded in speech. For the latter, more systematic and 
detailed observation is in order. Storytelling sessions are excellent devices for 
revealing pupils’ oral language facility and they serve other instructional pur- 
poses as well. A more contrived but laiily specific procedure is to ask pupils 
to tell what certain words mean. The words can be selected from primers or 
from word lists and thus a child’s oral familiarity with the words he must 
learn to read can be determined rather precisely. 

Experieru'c. A child’s general experience with adults, peers, and books 
has a large bearing on his readiness tor reading. Children who have been 
talked to and who have talked much themselves, who have had stories read 
to them, who have possessed picture books, and who havf* seen their parents 
read for pleasure tend to have an initial advantage over those who have had 
less of these experiences. Conferences with parents or home visits can provide 
some information. In addition, the pupils themselves will talk about their 

4 Extensive discussion of visual, auditory, and other sensory measurement is given 
in school health textbooks. 



LANGUAGE ARTS 


225 


experience at home if given the chance. In using any pupil reports, careful 
attention must be given to the six-year-old’s penchant for inaccuracy, exag- 
geration, and sheer invention. 

Readiness Tests as a Measuring Procedure. Reading readiness tests are 
the only systematic procedure many teachers use to measure readiness. Such 
tests can be effective in appraising phases of a child’s language development, 
his information, and certain aspects of his mental development; and, as we 
have seen, these arc dimensions of reading readiness. However, readiness tests 
are tests. For most pupils they may be the first experience with a test and 
the strangeness alone may depress many pupils’ scores. They require clear 
understanding of and strict adherence to directions and a child who is ready 
to learn to read still may be unable to understand or to follow complicated 
directions the first time he is exposed to them. The tests arc necessarily short 
and thus their sampling of dimensions is limited. Therctore, primary teachers 
are urged not to depend on a readiness test score as the only index of readi- 
ness.^' 

Used along with other procedures, readiness tests are an invaluable aid 
to the first-grade teacher in gauging the readiness of pupils for reading. Many 
reputable ones arc on the market. They arc relatively inexpensive, easy to 
administer, easy to score, and manuals explain the significance of raw scores 
and usually give tables for converting them to percentiles or other derived 
scores. A number of items from two tests are presented in Figure 40 to illus- 
trate the nature of readiness tests. 

As a practical consideration, a school or school di'-trict may need to choose 
between administering an intelligence test or a readiness test in the first grade. 
While no hard and fast rule seems advisable, it is thought that the intelligence 
test should be given priority because it has significance for all aspects of the 
first-grade progran; Moreover, scores on mental tests used alone are about 
as predictive of success in reading as arc ores on readiness tests used alone. 

STANDARDS OF READING R1 \DINESS 

At the outset of this 'section on measuring readiness, it was stated that a 
child with a mental age of six years, six months, and with all the normal con- 
comitants of this level of mental development, was considered to i'C ready 


•* Designers of readiness te^ts standardize them on primary pupils and thus norms 
are somewhat adjusted to the picsencc of just such variables This, howe\cr. jssurcs 
their validity more for groups than for individuals and it does not incieasc their reli- 
ability It is interesting to note that readiness tests and teachcis’ foiecasts of leadmcss 
(based, presumably, upon casual observation of and reflection upon many of the dimen- 
sions of reading readiness) ma> not be too closely coi related and hence neither are 
infallible predictois of success in reading. Foi example, one researcher, Hcnig (14), 
found the equivalent of a coeflicicnt of correlation of 60 bctv^cen teachers’ ratings and 
scores on the Lec-Claik readiness lest admimsteicd to 98 pupils The test bore a conela- 
tion of .55 with marks given in reading at the end of the year, while teachers' forecasts 
showed a coefficient of .59 with end-of-thc-ycar marks. 



226 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 



Teacher says, “Maik the halchet ” 



Teacher says, “Draw a cioss on the right pictuie In the Fall, father lakes the 
leaves and burns the leaves ” 



Teacher says, “Mark the one to carry in the parade on the Fouilh of July ” 



Teacher sa>s, “Draw a t>aine around the picture that is just like the one in 
the middle ” 


☆ ☆☆☆☆☆ 


Teacher says, “See the stars. Mark tour of the stars.” 











LANGUAGE ARTS 


227 



Teacher says, “Draw another picture 
just like the one you see.” 






no 

in 

nip 

i 

on 

imp 


Teachci says, “Draw a iing around 
the word that is like the one on this 
card.” (Teacher holds up a Hash card.) 


Figure 40. Sample items from readiness tests ( I he first six arc from the Mctiopolitan 
Readiness 'tests, Form R, Kg- 1, 1949, the last fiom the Miirphy-Durrell Diagnostic 
Reading Readiness lest, Piimary 1947; both cop\right by World Book Co, honkers, 
the publishers, and reproduced by special permission ) 

to learn to read. In terms of the dimensions of readiness we have just been 
discussing, “nonnal concomitants” mean the sensory development, language 
development, and experience typical for children whose mental age is six 
years and six months. By implication, children who are at this level learn to 
read with average ease; superior ones learn to read uith greater case and 
speed; and those below the norms have more difllculty or, if they deviate too 
greatly, arc entirely unable to learn to read. 

The following outline presents the normal conditions of reading readiness 
in outline form. Expanded descriptions of typical development for children 
with MA\s of six-six are to be found in the behavior profiles of child de- 
velopment textbooks (see Gescll and Jig (9) in particular). 

Mental development: MA t 6-6. Scores on language portions of mental 

test no less than this. 

Vision and heating; Near-vision noiinal fot chronological age ol 6 6 or 

corrected, no uncoi reeled astigmatism; history of 
normal hearing at all voice frequencies. 

Language development- At 50(h percentile or better on \ocabulai and sen- 
tence parts ol standardized readiness test; speech 
at or above median tor unscleclcd entering first 
giaders (social and interest aspects as well as vocabu- 
lary and usage) 

Experience: History of story leading and book familiarity; ex- 

perience with the ideas and objects to be encountered 
ill reading program; experience in taking turns, draw- 
ing, relating anecdotes. 

Readiness tests: 50th percentile or other score which test norms say 

is minimal for first cflorts at reading. 



228 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


Reading Proper 

A child’s IQ is usually the measure of greatest importance for determining 
the educational program that is offered him. Next to this in significance, and 
surpassing it in some situations, are measures of his reading ability. His place- 
ment in reading groups, the pace and extent of instruction in other subjects, 
and even the elementary teacher’s over-all opinion of him as a pupil all arc 
a function of his measured skill in reading. At the secondary level, measures 
of the pupil’s reading ability frequently are used to determine the English 
section he should be assigned. If differentiated grouping is practiced in other 
subjects, it may be used along with the 10 to determine if he is an ''X, Y , or Z'' 
pupil. 

DIMENSIONS OP RFADING 

The dimensions of concern in the measurement of reading are few or 
many, depending upon the purpose of the measurer. For a quick general 
appraisal, it probably is sufiicient to use rate and general comprehension. 
Rate, of course, is a simple and clearly measurable dimension — the speed at 
which an individual reads expressed, as a rule, in words-per-minute. Com- 
prehension is more complex but, in most cases, it means the accuracy with 
which a pupil can recall the details of what he has read. Sometimes, to this 
is added his understanding of the general idea or purport of passage. 

When reading is to be appraised for given subjects and activities, when 
evaluations are to be made of individuab lather than groups and, in partic- 
ular, when a pupil’s diOiculties v ith reading are to be diagnosed, other dimen- 
sions need to be added. Moreover, those of comprehension and rate need to 
be broken down into more specific components. In the following outline, an 
attempt is made to list many of the dimensions of reading that are susceptible 
to measurement. Some of them arc the focus of given standardized tests. 
Others pertain to certain school grades and measurement purposes. It is 
unlikely that a teacher will be concerned with all of them in any one situation. 

Some DimensuMis of Reading 

Number per line, time and place of 
fixations, number of rcgiessions; more 
a laboratory than a school dimension. 

Voice and hand accompaniments Extent of whispering, lip movements, 

finger reading, of concern in primary 
grades. 

Postural accompaniments Nature and extent of maladjustive pos- 

tures; of special concern in elementary 
grades. 


Mechanical 
Eye movements 



LANGUAGE ARTS 


229 


Recognition techniques Identity of and accuracy of, as, sound- 

ing, context, configuration, and syllabi- 
fication, of concern in elementary 
grades and m diagnosis of retarded 
rcadcis 


Vocabulary 

Oral 


Sight 


General 


Rate 


C oni prehension 
Sentence 


Paragraph 


Passage 


Oral reading 


Identity and number ot words pupil 
understands when spoken of concern 
in primary giadcs and in diagnosis 

Identity and number of words pupil 
cm pionounce and undeistmd when 
seen, of concern in primary grades 
and in diagnosis 

Identity and number of words known 
whtdiei spoken or seen, of concern 
at all grades but paiticulaily in sec- 
ondary and college 

Words read per minute assuming good 
comprehension should be dilleren- 
tiatcd according to puipose and ma- 
terial of concern at all levels 


Length and dithciiltv of sentences pupil 
can understand, accuiacy of same of 
conctin at all levels 

length, difficult) and tvpe of pari- 
giaphs pupil can iindci stand accuracy 
ol same ot conceiii at all levels 

Difficultv of passiues pupil cm under- 
si md, acciH icy of same wilho it regaid 
to length ol concern at all levels 

Rate and comprehension, in idchtion, 
accuracy ol pionui'ciation nd word 
emphasis, appiopri itcni ss of lurie rate, 
and inflection according to natuie of 
matciial, wav of sounding unknown 
woids, ol concern in pnniaiv giadcs 
and at all levels f(^r public speaking 
and diamatics 


DiUcn rtial purpose s 

To skim Rale and accuiacy for any of the 

To memorize purposes, of concern in uppci clemen- 

To follow diiections tary, secondaiy, and college levels. 



230 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


To get specific information 
To learn how to do something 
To gain general knowledge of a subject 
To judge the validity of an argum^ 
or conclusion 

To judge the literary or technical m€rit 
of a passage 

To be entertained oi inspired 


Differential suhiec fs and materials 

I Mathematics 
^ Chemistry 
( Etc. 

Advertising 

Maps 

Charts and graphs 
I ttc. 

/ Fiction 
I Poetry 
( Etc. 

Special reading tasks 


Special disabili ties 


Vocabulary, rate, and /or comprehen- 
sion for any subject or material; of 
concern in upper elementary, second- 
ary, and college levels. 


Rate and accuracy in use of table of 
contents, indexes, glos<^ncs, directories, 
dictionaries, encyclopedias, footnotes, 
etc.; of concern Irorn intermediate 
grades on. 

Identity and extent of systematic errors, 
letters and common woids not known, 
etc.: of concern in diagnosing cases ol 
leading retardation. 


PROCEDURES AND FORMS FOR MEASURING READING 

Of the possible methods of behavioral measurement, only product analysis 
is not readily adaptable to appraising a pupil’s status in reading. Observation 
is necessary for the mechanical dimensions and is valuable for the several 
special reading tasks of index and dictionary usage, etc. Often, the most 
valid approach to measuring comprehension is a free-response procedure — the 
pupil summarizes or interprets what he has read in writing or orally. Guided 
response instruments, however, are employed more than any other device 
and can provide reasonably valid and highly reliable measures of most dimen- 
sions of reading. 

Reading Tests. Various types of guided response items used in reading 
are illustrated in Figure 41. The fact that the samples are adapted from 



LANGUAGE ARTS 


231 


Rate 

17. 

18 . 

19 . 

20 . 

21 

(Pupil checks the number of the line he v\as leading when the ex- 
aminer called “stop ”) 


SeveiJil ol the objects in the night sky aie not 
stars but planets ||| 

One of the most intercstin| [Biets is the 
ringed one called Satuin. Its ringslR-e foimcd 
by a cloud of small particles that circle the 


Conipfi In tision 

Sentences 

So intense wcie the flames Irom the iactoiv that it w is Irulv a confla<*Kn'on 
(a) steel mill (b) big fire (c) celebration (d) false akiim 
(Pupil indicates the best synonym for conllagr ition ) 

Paragraph s 

John, who weighs HO pounds, wants to fight in a class which has a top M’lghl 
of pounds He intends to lose ten pounds and this will make him eligible 

(a) eligible (b) intends (c) ten (d) lose 

(Pupil indicates which ol the option words ‘spoils the meaning ol 
the last sentence ) 

7 passant s 

(Pupil reads a given passage — timed, iisiiail)- and answeis a senes 
ot multiple-choice ciuestions about vshat he has read ) 

Dtjfc nntial Purpm^ \ 

CkhiiuI suprlunnn 

11 turopt is the home of the white Moiks People aic delighted when the> 
fly noilh in the spjing Many hclicvc that storks bni good luck to \ village A 
family is proud when sloiks choose their root on wh.ch to build a . cst istorks 
do hung one kind of ical luck— the> cat anything that is tha wn out 1 his helps 
to keep the village clean and health) 

Draw a line unde ihc woid that tells wha* main people ocIilvc 
storks bring 

food nests spiing riches luck 

figuie 41. Type*' of guided response items used to mcasuie various dimensions 
of reading abilitv (The four items for “diflcientiil purports aie by peimission from 
CJatis’ Basic Rtadinu lest foi Grade' 111 (second half) through Grade VIIl, I cachers’ 
College, Bureau of Publications The balance aie adapted fiom those appearing in 
many tests ) 



232 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


Predicting outcomes 

7. No other creature of the woods more timid than the deer, yet when 
winters are particularly severe, deer |(ave been known to visit farms and to 
accept food One I ebruary after a ^ptch of very cold weather we saw deer 
tiacks around our barn. I hat night We threw some hay outside and along with 
it some kitchen scraps In the morning we went outdoois 

Oui horses were eating the scraps 
Wc SaW two deci eating the hay 
We saw fox tracks in the snow 
Wc saw the deer pulling our sleigh 

(Pupil underlines one of the sentences ) 

l^ndet standing pncisi directions 




22. These dilTcicnt sh ipcs all hast dilTercnt n imcs Ihc first i^a circle, it has 
no cornels The second is a square it has four corners Do >ou know the name 
of the third ^ It is longer than i square Mike a cross m the ctntcr of the one 
that has no corners 

Noting details 

A book written neul\ three hundred years ago Tells a stoiv tibout a tree which 
grew in America The story said that this tree ciicd \\hen it was cut It also said 
that tears cime from the cut which dried into a sweet sugar Now wc know that 
this crying tree was the sugir maple, and that the sweet tcais became maple 
sugar when they dried 

An old book tells a story about a 

Hag tree hook man 
The story said that li this tree was cut it would 

laugh cry sigh cat 
Ihe sweet tears tiom the tree become 

salt powder sugar flour 
(Pupil underlines one word for each of the questions ) 


Figure 41 {Continued) 




LANGUAGE ARTS 


233 


Vocabulary 

Option (a) engine (b) sweet, (c) hurry (d) choice 

(Pupil selects synonym from optiom) 

Special Reamng Tasks 

Index 

On one page a portion of an index. On the other, such multiple-choice ques- 
tions as: 

On what page will you find a picture of an a!|Satross? 

(a) 37 (b) 99 (c) 138 (d) 22 

Maps, tables, fixtures 

On one page a map and a population chart for cities. On the other, such mul- 
tiple-choice questions as: 

The city with the smallest population is nearest on the map to 
(a) London (b) Berlin (c) Paris (d) Hamburg 
Directory 

On one page an office directory tor a large building. On the other, multiple- 
choice items associating names and olliee numbers, as: 

David D. Jones 

(a) 107 (b) 214 (c) 32 (d) 1011 


Figure 41. {Contmued) 

Standardized tests suggests, as it should, that standardized tests are currently 
the mainstay of measurement in reading. Because the United States is sur- 
prisingly homogeneous so far as written language is concerned and because 
reading materials and the pace of reading instruction are much the same 
throughout the country, the growing dependence of teacliers on .standardized 
reading tests seems entirely legitimate.^’ However, this should not deter any 
teacher from devising and using his own guided response instruments, adapt- 
ing items like those in Figure 41 to his own purposes. 

The raw scores on standardized reading tests usually arc converted into 
grade equivalents and/or percentiles for a given grade, 1 hese, as you know, 
are indexes of rank and also tell where a given pupil stands within the large 
group of pupils to whom the test was given during the standardization process. 
A given reading grade means that the pupil score is the same as the average 
or mean score made by pupils in the given grade. For example, if Bill scores 
63 points on a test and this corresponds lo a reading grade of 6, this means 

A number of reading tests together with descriptive rcmaiks arc listed in Appendix 
B along with other standardized tests. 



234 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

that the average score of sixth-grade pupils, in the standardization population, 
was 63. If the reading grade is fractional, say 6.7, it means that the pupil’s 
reading ability supposedly corresponds to that of an average pupil who has 
been in the sixth grade for seven months. A percentile score on a reading 
test has significance only within a specified grade or grade segment and should 
not be used unless the group to which it refers is indicated or clearly under- 
stood. 

Scores on tests devised locally should be considered as indicative of rank 
only and may be converted into percentile scores (see page 157). If raw 
scores on the test have a restricted range or if only gross differentiation among 
pupils is necessary, the determination of quartile limits (see page 152) and 
thus the classification of pupils into upper, lOM'cr, and intermediate quarters 
may be all that is necessary. Any frec-response instrument used to assess read- 
ing may yield classification or rank scores. Measures derived from observa- 
tions of reading must, as a rule, be verbal descriptions or classifications accord- 
ing to some rating system. Scores derived from any nomtandardized procedure 
should not be converted intcy reading grade ccjuivulents. 

It is becoming ever more common to express competence in reading as 
a scries of scores or as a graph showing the status of the pupil with respect to 
the several dimensions that the test measures. Since reading is a complex of 
many dimensions, this profile approach, as it is called, seems to be a far more 
appropriate way to measure a pupil’s reading ability than is a single score. See 
pages 125, 373 for illustrations ol test profiles. 

Cautions in Use of Stamlardized Tests. In using standardized tests cer- 
tain cautions must be observed It is probable that they will have limited 
validity for the extremes of norm groups. For example, a test designed for 
grades 4, 5, 6, 7, and 8 is less valid for fourth and eighth gradcis than it is 
for grades 5, 6, and 7. Stanley (24) found that the Nelson-Denny Reading 
Test for High School and College was too difficult for the lower half of a 
typical ninth grade. Secondly, nearly all standardized tests are timed tests. 
In sections that purport to measure comprehension, the pupil is likely to try 
to read as fast as he can and his rate may be greater than his u’^ual rate. Con- 
sequently, for many pupils it is probable that comprehension during acceleia- 
tion is being measured and not general comprehension (21). A third point 
of caution is that the vocabulary, content, and format of the reading tasks of 
standardized tests may be somewhat unlike that of the material the pupils 
normally read. These differences will have a particularly depressing effect on 
readers with little self-confidence, those who show appreciable test anxiety and 
those whose reading is nearly all school-connected. 

Diagnosing Disabilities. The diagnosis of reading disabilities is an im- 
portant but highly complex undertaking. The use of standardized diagnostic 
instruments is time-consuming and requires a more extensive knowledge both 
of measurement and of reading than may be presumed for the average teacher. 
Descriptions of diagnostic procedures and instruments are to be found in 



LANGUAGE ARTS 


235 


textbooks on reading method and on remedial techniques. Diagnostic dimen- 
sions are essentially no different from those presented in the outline on pages 
228-230. They are, however, likely to be greater in number and more detailed 
and individualized than those measured by ordinary reading tests. 

A nonstandardized observational procedure, recommended by Dolch (4), 
can yield significant, if imprecise, information about a poor reader’s difficulties 
and requires no social training for its use. Tn this procedure, a pupil first reads 
a passage aloud and the teacher supplies any words over which he hesitates. 
This can disclose the common words with which he is unfamiliar Second, he 
reads aloud another passage, is given no help, and then is a.skcd to relate what 
he has read. This should afford some evidence of his general comprehension 
of what he reads. A third passage then is read by the pupil. Each time he 
hesitates over a word, he is asked to guess what it means and thus a measure 
is provided of his ability to use context clues. Finally, in a fourth passaj?e, he 
attempts to pronounce and tell the meaning of unknown words in Oider to 
determine what M ord attack methods he uses and how' skillfuIK. 

LVALUATIVE S I ANDARDS IN READING 

The only precise objective standards by which a pupil's reading ability 
may be evaluated are the grade norms of standardized tests. 1 hese indicate 
the usual range of reading test scores for given grades and, presumably, the 
place of a pupil’s score in this range is indicative of how bad to lair to good 
is his reading ability for that grade. In addition, a pupil’s reading grade place- 
ment as derived from the norms of a standardized test may be compared with 
Ills actual grade placement. If the reading grade is higher than the actual grade, 
his reading ability may be judged favorably; if lower, adversely 

The relationship between given rates of reading and absolute indexes of 
comprehension, on the one hand, and degrees of success in school and occupa- 
tions on the other, might provide a nonrelative and independent standard. 
Unfortunatcl}^ although the relationship is known to be positive and appre- 
ciable, it has been determined with no precision. Si'i far as rate is concerned. 
It is possible to use the wwd-per-minute equivalents of grade means and per- 
centile ranks in given grades as some sort of nonrclatw^e standard. For ex- 
ample, if on most standardized tests the mean w.p.m. of College freshfuen is 
250—300, a high school senior w'hosc rate was in excess of this could be judged 
to be able to read freshmen reading assignments in the time allotcd, while 
one W'hosc rate was less than 250 w'.p.m. could be expected to spend undue 
time on reading assignments 

It is felt that the matter of ability differentials should be given particular 
consideration in evaluating reading. Surely, immature primary pupils should 
not be given low marks in reading as a penalty for their immaturity. Yet, if 
strict competitive grading is practiced, this will result since mental maturity is 
the most important single determiner of progress in reading in the primary* 
grades. Consequently, it is recommended that descriptive statements, profiles 



236 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


of status in several reading dimensions, and teachcr-parcnt conferences, one 
or all, be used for reading in the elementary grades rather than a single letter 
or word mark. 


SPEECH 

While effective speaking is a sine qua non of education, it is perhaps the 
least measured of all educational phenomena. There arc no standardized in- 
struments listed for its measurement in the most recent Mental Measurements 
Yearbook. Dimensions are not exactly defined and there are no precise evalua- 
tive standards for the speech of school pupils. 

Dimensions of Speech 

In general, the systematic evaluation of status and improvement in speech 
is of concern only to teachers of public speaking and drama and to English 
teachers who include speech or dramatic expression in their classes. In addi- 
tion, speech correctionists and pathologists engage in the detection and diag- 
nosis of vocal disabilities. However, this sort of clinical measurement is beyond 
the scope of our text. Ihc dimensions that speech teachers most commonly 
assess arc stated in the following outline. 


Rate 

Words per minute. 

Rhythm 

Smoothness — unevenness of talking; appropriateness 
of pace to subject; observance of meter it poetiy. 

Pitch 

Usually expressed qualitatively as high to medium 
to low, but can he measured in mean cyclos-per- 
second. 

Amplitude 

Usually e\t>Tesscd qualitatively as loud to medium to 
soil, but can be measured in mean decibels. 

Tone 

Extent of resonance, nasality, h.irshncss, etc. 

Enunciation 

Correctness and clarity with which words arc pro- 
nounced. 

Diction 

Appropriateness and variety of oral vocabulary and 
usage. 

Posture and movement 

Manner and appropriateness of standing, facial ex- 


pressions, gestures, etc. 

Several of these might be omitted and others would certainly be added 
according to the nature of the measurement task. In public speaking, dimen- 
sions relating to the content of speeches should be important and in dramatics, 
such factors as stage movements, ability to memorize, and “stage business.” 



LANGUAGE ARTS 


237 


Forms and Procedures in Measuring Speech 

Description or classification and ranking arc the measurement forms ap» 
plicable to most speech dimensions. Rate, of course, may be expressed in 
mean words-per-niiniile, a scale number. 

Observation currently is the principal measuring procedure appropriate 
for school use. By the nature of the phenomenon instrumented measurement 
is likely to have questionable validity. Papcr-and-pcncil tests must be devoted 
almost entirely to measuring a pupil’s vefhal knowledge of correct speech, 
not his performance of it; and, unfortunately, there is not a very high or even 
necessary relationship between the two. It's as though a rifle expert were to 
have his marksmanship judged, not by shooting at a target, but by answering 
questions about how one should shoot at a target. In observing a pupil’s 
speech, efficiency will be gained by using an appropriate observation schedule 
or rating scale. The construction and use of these as well as principles of 
observation arc described in Chapter 4. 

In addition to observing a pupil's speech directly, it is possible to record 
some of it and then listen to the recording. The advantaL’C of this process is 
that the tape or disk can be replayed and thus the examiner has more tini'' 
and can listen for specified dimensions without feeling that he is neglecting 
others. Moreover, a pupil can listen to a recording of his (>wn voice and be 
directly appraised of his correct and incorrect utterances. Foreign language 
teachers more and more arc using tape recorders !H)th for their evaluations 
and for student self-evaluation. 

Evaluative Standards in Speech 

Textbooks on vj’)cech and manuals of leaching method represent, in gen- 
eral, the characteristics of elfcctive speaking. Presumably, a icacher may com- 
pare how a pupil talki with the book's prescription and judge him accordingly. 
However, textboviks usually present an ideal way of speech and not those 
less acceptable, and it is necc* sary for the teacher to |udge how close to the 
ideal eleventh- or ninth-grade pupils should he expected to come. Moreover, 
the statements in (he book must be translated into \isual and auditory sets 
before the h acher can use them as standards in evaluating a pupil’s talk In this 
process, the teacher's own style of speaking will have a neces:»ary inlfiience 
and, thus, the standard actually used by the teacher may be extremely sub- 
jective. 

A belter sort of standard for speech is thought to consist of rcctirdings 
made b> previous pupils that represent the range of competence commonly 
found in the grade in question. These could be played to the class at appro- 
priate times and the teacher could indicate how he evaluated each specimen, 
even giving them ^’s, C”s, and F’s, if he wished. Pupils then might have a 
more realistic notion of what they could accomplish and how this might be 
graded. 



238 j| CUSTOH|A#CY USES OF MEASUREMENT AND EVALUATION 

As well as furnishing a fairly objective group standard, recordings also 
can provide each pupil with an individual standard. This would be a recording 
of each pupil’s speech made at the beginning of instruction. At the end or at 
several times during the instruction, another recording could be made using 
the same material. A comparison between the two would show how much 
progress had been made. 

Course or subject grades seldom arc assigned to speech as such except in 
public speaking and dramatics classes. Marks tend to be somewhat higher and 
factors of elTort and citizenship may have a greater bearing than in the three 
R’s or in more academic secondary courses. For pertinent procedures of mark- 
ing and reporting, referral is made to Chapter 9, pages 207-212. 


COMPOSITION 

Composition is the term usually applied to free writing, where the pupil 
provides his ow n words, phrases, and sentences and does not merely copy from 
a text. It may take the form of exposition, description, narration, or argument 
and may be imaginative or factual Elementary teachers usually become inter- 
ested in evaluating a pupil’s ability to compose in the middle grades, say IV 
or V, and continue their interest through the balance ol the elemental y grades 
In secondary schools only English teachers and, sometimes, instructors in the 
Social Studies, are much concerned with measuring skill in ^composition. In 
our treatment we will concern ourselves only with the general aspects of com- 
position, omitting those unique to its special forms. Evaluative procedures for 
the latter — “plotting” in narrative writing, rhyme and meter in poetry, for 
example — are described in student texts and in instructional handbooks de- 
voted to the special forms. 

Dimensions of Composition 

Probabl> as many dilTerent lists of composition dimensions have been 
devised as there are English teachers and English texts But it is thought that 
those presented in the following outline would be included in nearly all ol 
them. Moreover, the breakdown of the outline is consistent with the break- 
down usually observed by standardized tests. As in other language art areas, 
a teacher seldom will need to measure all these dimensions m a given situa- 
tion and may wish to add others. 

Some Dimensions of Composition 

1. Syntax Appropriateness and accuracy wi<h vcib tenses and 

forms, plurals, pronoun cases, placement of modifiers, 
complete sentences, etc. 

2. Capitalization Appropriateness and accuracy with proper nouns, 

titles, place names, first words in sentences and quota- 
tions, etc. 



LANGUAGE AMS 

3 Punctuation 

4 Format 

5. Spelling 

6 Vocabulaj y 

7 Sentence iacility 

8 Paiagiaph taeiliU 


9 Style 


239 


Indentation, margins, the special foims of letters, 
verse, outlines, etc 


Accuiacy ol pupils definition, appropriateness of 
words used to ideas, extent ot words used accurately, 
etc 

\ssuming syntactical coi redness, ‘ppiopii itencss ol 
form to thought, parsimony of expression, placement 
of modifiers, phrases, and clauses aptness of figures 
of speech, etc 

Organizition (topic sentence chronology, opposition 
ind resolution, etc ) use of p irallel sentence forms, 
transition words .ind phrases coherence and unity 
of ideas, etc 

Ihc most ill-defined t nd Viiiv^iisly defined ol com 
position dimensions includes such variables as varinN 
of expressions, humoi subtlety or obviousness, in f r 
tst and ‘ v\ai nilh’ , bcloie allcniptiim to me isuie st\Jv. 
It must be <ii\cn opei ition il dclinition 


I he lust loui dimensions arc oltcn called mechanics or cramm u .aid they, 
along with spelling and vocabulary casilv sati'Nh our conditions of measur- 
ability They arc observable, discrete, weh-dcfmcd and agreement as to what 
constitutes correctness for each of them is relatively easy to obtain Not so 
the last three These, having to do with rhetoric or cnecincntss is against 
correctness, aie obseivab'c but they lack the precise dcrimtion nd aircement 
among observers, so necessary lor accurate measuicmenl ‘Style” (i course 
IS the least measurable of all it is included only because teachers so Irequcntly 
try to evaluate it ind because the attributes it symboli/cs howevci vaaue arc 
ol great importance 

Because the rhctoiical dimensions arc difhcult to mcasuic objective i\ and 
accurately, it is especially important that they be dehned 'n very spicilic 
language It is not siiliicicnt to say “i want to measure mv pupils ability io 
write efTcclivcly,” and then U’ devise a test or to stan judging their written 
products It IS necessary as a prelude, to spcvify the detaih of these dimcn 
sions and the way these details will be manifest in a product oi m responses 
to test questions 

Forms and Procedures for Measuring Composition 

Expression of a pupil’s status in the dimensions of composition largely is 
restricted to description-classification and to rank symbols A number of 
different systems are used — letter maiks, written comments, standard scores 



240 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

and percentiles on standardized tests, grade placement, vocabulary age, 
— but none of them has the characteristics of a true scale. 

PRODUCT ANALYSIS 

Composition is admirably suited to measurement by product analysis pro- 
cedures. The outlines, summaries, letters, themes, and stories that pupils 
produce so that they may learn to write arc at the same time evidence of their 
ability to write at any given moment. In lact, for practical purposes a pupil’s 
writing ability is no more than the merit of what he has written, so there is 
often no need to resort to tests for appraisal ot this phenomenon. 

In marking written products, a teacher may read the passage and assign 
a single mark to the passage as a whole Or, f.e may use some factor counting 
or rating system and assign numbers or marks to the several dimensions of 
the passage. The first method is quick, is consistent with the fact that a piece 
of writing is a unit and not just a cohcction of parts; and probably it is the 
method of product appraisal most widely used by teachers On the other 
hand, "‘over-all impression” marking is less reliable and far more subjective 
than a detailed analysis and is criticized by most researchers and measurement 
specialists (27) dhe College Entrance Board ‘"Essays” arc marked dilTcren- 
tially for five dimensions and the Educational Testing Service scores its For- 
eign Service essay examinations in similar detailed iashion. It is thought that 
the increased reliability and diagnostic value of a detailed analysis more than 
justify the extra time it requires. Techniques for this type of product appraisal 
are explained in Chapter 5, pages 76-8 J . 

COMPOvSmON TESTS 

When it is desirable to measure achievement in composition by a test of 
some sort, the whole or part question asserts itself in another way. It is pos- 
sible to ask tor a test response that coii'-litutes an act ot writing it equally is 
possible to ask for responses that are not in themselves the actions of writing 
but bear an essential relationship to writing. An CKampIc of the first type of 
test is the College Entrance Board (Jcncral Composition Test, Form F, which 
provides specific reading material on which an essay is to be based and then 
requires that the student write an essay of several hundred words on this 
material. 7'hc second lest procedure is exemplified by true-false and multiple- 
choice questions about rules of grammar and by items that ask for the meaning 
or spelling of words. 

You may recall the statement in Chapter 3, page 41, to the effect that 
maximum validity may be expected for a behavioral measuring instrument 
when it instigates the actual phenomenon subject to mcasu»cment. According 
to this viewpoint, tests in which pupils write or somehow do the very things 
persons do when they write, have the greater potential validity. Drawbacks 
to this type of test are that it may be inefficient in use of pupil lime, the reli- 
ability of scoring may be low and, in a given testing period, only one or a few 
of the varied forms and situations of writing may be sampled. 



LANGUAGE ARTS 


241 


The other approach — to elicit responses that are not themselves the usual 
acts of writing — must base its claim of validity on the premise that the test 
responses are directly related to how a pupil writes While this is recognized 
by psychomctricians as a legitimate rationale, the relationship must be demon- 
strated and the relationship often is too small to assure an adequate degree 
of validity in the test The advantages of such an indirect measuring proce- 
dure he m its speed ot administiation, in the reliability and even automation 
of Its sconng, and in the careful and comprehensive sampling of the various 
forms and situations of writing that can be accomplished in a single testing 
period 

Dirut and Holistic Pioctdurcs There «*eem to be about three possible 
ways ol directly testing a pupil’s composition ibility 

1 Rating or analyzing a passage written on dem ind 

2 Comparing such a passage with a product scale, and 

3 Editing a picscntcd passage 

The first is well illustrated b\ the College Entrance 1 x imination just cited 
The dimensions with which this standardized test is concc»‘ned and the manner 
of rating them is shown m Figure 42 a reproduction of the grading sheet to 
be used by the examiner lor each cssiy Ratings mean 1 superior 2, h gh 


CiR\DlNG SllfEl 

NO or PAPiR \o or rfvdfr 

DMh _ niVIL niC.UN riMl riNISHED — 

ImTCHXNU S I ‘ I VI I I ORC XM/ATION RT ASOMNC 1 COMLNT 

1 

Bordti I 
2 

Boidf^r I I 


Border ( 


Place papeis in categoric 1 to 4 with an X, bordti may be marked 
only A Of V 

Iigiirc 4'’ Gidding Sheet fiom the College rntrinec Fximinition Board, General 
Composition lest Form P, Fduc itional Icsting Scrvict Piincclon Reprinted by Special 
permission of Education d Icsting Seivicc 




242 CUSTOlVfMW USES OF MEASUREMENl' AND EVALUATlOi 

adequate, 3, low adequate, ^4, inadequate. As may be apparent, the central 
weakness in this and any other essay rating procedure is the inability of 
examiners to agree exactly as to their working definitions of dimensions and 
rating categories and even to achieve complete self-agreement on reappraisal 
of the same paper. The aspects of the composition that may be measured 
most reliably are those that involve specific errors and can be counted: syntax, 
punctuation, and spelling. 

rhe use of product scales, the second procedure, is inl'rcqucnt possibly 
because so few reliable and well-standardized scales have been published 
Buros cites only Hudelsons Typical Composition Ability Scale. According 
to the reviewer, Osburn (20:316-317), it is the best product scale extant 
but still is an imperlect instrument. The scale consists of a scries of paragraphs 
varying in excellence from that representative ot the fourth giade to that ol 
the twelfth grade. A teacher finds the point of best fit and assigns a pupil the 
number corresponding to that point much as is done in rating handwriting 
specimens (sec page 252). Other product scales are the Hillcgas Composition 
Scale (the forerunner of all such scales), the Willing Scale for Measuring 
Written Composition, and the Lewis English Composition Scales. 

I he third holistic procedure, the copy-reading test, is illustrated by the 
third sample item in Figure 43. Copy-reading tests can be highly reliable, and, 
for this reason, are widely used in standardized English batteries. Their 
validity, however, is suspect. The technique requires the recognition and 
correction of existing errors. This, of course, is something writer may do 
m rereading what he has written but it is not what he docs when he composes 
originally. So the greater part of the validity ot the approach must rest on an 
assumption of direct and invariable relationship bclween skill in original 
composition and skill in editing. Such test investigators as Travers (27) and 
Griffin and Venable (13) think that this relationship has been exaggerated 
and, for that reason, question claims of high validit\ for the copy-reading 
procedure. 

Indireit and Atomistic Procedures. As we have indicated, a second 
basic way to test a pupil’s composition ability is to appraise separately each 
of the many items of knowledge and skill that presumably agglomerate to 
form his writing ability. The great majority of published tests use this ap- 
proach. There arc composition or English “batteries” w'hich in a single testing 
period, test a sample of all the important knowledge and skills involved in 
writing and there are as well separate tests for each basic subdivision of skill 
or knowledge. The usual components of batteries and the subjects of separate 
tests arc mechanics (syntax, punctuation, etc.), spelling, vocabulary, and 
rhetorical effectiveness. Guided response items used for the measurement of 
mechanics arc illustrated in Figure 43 following. Spelling items are shown 
in Figure 44, page 245, and those testing rhetoric in Figure 45, page 247. 
Figure 41 in the section on reading also contains a sample vocabulary item. 
The types of items in the figures, while derived from standardized tests, are just 



L aIVGUAGE arts 


243 


5. Leathern stay a little longer if they uant to? let’s^^s go home. 

1 2 ^ ‘ 


; to^ let’^^s 


(Pupil selects the Lindcrlmeti part he thinks is wrong ) 

11. There ||j_2 wti'enH) the one on v\hich m\ grandfather 

sailed long ago 

(Pupil selects the coiicct veib form ) 


John walked to the door with 

his guest “let me know when 
20 

you arc in town again, colonel 
21 

Johnson/’ he said ‘if 1 am not 
22 2 ^ 


20 ( ) 

21 ( ) 

22 ( ) 

2 ^ ( ) 


(Pupil writes c in the ( ) if he thinks a word should be capitalized, 

s if he thinks it should start with a small Icth i ) 


1 seem to be one of those fortun ite people who iCcOgni'es that its j 
privilege to sec the common place beauties which suiiound us Whs don t j 
everyone see thcni^ Hasn t it occiiiied to 


(Pupil copyieads the pissige md edits it «is he thinks neCLSsaix ) 

I wonder what hos planning to do whispucd 
44 4 s df-) 

Flank to John I do not think lie has an\ ical iiijuiv 

4"^ -IS 

do you lohn thought it best to mike no leply 

49 50 51 

(Pupil mdic ites whit punctuition it an\ is needed above each ot the 
numbers ) 


Figure 45. b\..rnples ot glided icsporse items designed lo lest a pupil hiowledgc 
of the mechanics of composition (All arc fiom the Cooper iiivc T nglish Tests foi Hich 
School, published bv Fducilionil Testing Strviet Pmccton l<i.pimtcd b special 
permission of IducituMid testing Service) 


as appropriate for nonstandardi/cd instruments It is nitciesiing to note that 
the Items depart from the traditional liue-false, muItiplc-choice types, thus 
demonstrating the flexibility of guided response procedures 

Mechanics 1 he measurement of a pupil's knowledge of syntax, punc- 
tuation, etc., as things apart from his actual writing is open to several criti- 
cisms Correct usage docs not invariably follow knowledge of the rules and 
elements of correct usage ManuaK and texts covering mechanics often differ 
as to what is correct usage The inclusion of items m tests often seems to be 


244 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

based on frequency or importance in English texts rather than in public 
writings. These, and other points of criticism, arc supported by research but 
the measurement of mechanics by means of guided response instruments 
persists. I’hc analysis by Griflin and Venable of ejeven standaidized tests of 
usage (13) is a representative critique of this aspect of compositional meas- 
urement. 

In our ViCW, analysis of the mechanical errors in whatever products the 
pupils write is a far more valid and sufliciently reliable means of measure- 
ment for mechanics. 11 a test is desirable, passages can be written on topics 
designed to elicit certain usages or pupils can be instructed to write sentences 
and/ or paragraphs containing the elements a teacher wishes to appraise. For 
example, ‘'Write a sentence containing direct address and another containing 
indirect address.” “Compose three sentences about a boy, a shotgun, a dog, 
and a rabbit. One ^should be a simple sentence, one a complex sentence, and 
the third a compound sentence.” It is necessary to resort to standardized tests 
and thus to guided response items for mechanics only when comparative 
data is sought, i.e., the grade placement of a pupil’s knowledge of mechanics. 

Spelling. A pupil's ability to spell may be judged in either of two ways. 
The writings he submits may be examined for misspelled words with fre- 
quency of error considered to be an inverse index ot his spelling ability. Ihe 
chief limitation of this method is that any pupil’s free-writing vocabulary 
may be unduly restricted to words that he can spell. Its primary advantage 
is that spelling is measured as it is ordinarily done. 

The second way is to ask pupils to spell a number of specified words. 
This method has the advantages that attend use of a controlled list of words. 
They can be made easy or difficult at will. I hey can be chosen to represent 
given types of words. If the list is extensive and carefully chosen, it can be 
considered a representative sample of all words. Moreover, pupils can be 
compared with each other more fairly than when product analysis is used. 
As its disadvantage, the spelling list method gauges the spelling of words in 
isolation and under conditions that exist lor the test only. 

It would seem that both methods are necessary if spelling ability is to be 
measured efhciently. The first provides the more diagnostic information and 
can be performed casually and continually. The second is essential for com- 
parative data and for any comprehensive appraisal of the words pupils can 
and can’t spell. 

The first task in devising a spelling test is to determine the words to be 
included. It is common practice in elementary grades today to select these 
from among the words pupils use in school and/or arc exposed to in their 
readers, subject texts, and collateral materials. If spelling is taught as a 
specific subject and if spelling lists aic used lor the in.Ntruction, the words on 
the lists are the ones to be included in the tests. 

The words to be used in any test should be judged in terms of appropriate 
difficulty and adequate sampling. Publishers of standardized spelling tests 



LANGUAGE ARTS 


245 


generally grade the difficulty of words they use for any grade level from those 
most pupils can spell to those very few can spell I his is a good criterion for 
a teacher to lollow in devising his own tests If the speller lists words in order 
oi predicted difficulty or according to the grade in which they should be 
learned, it is relatively easy to make a selection ' It is not expected nor neces- 
sary in teacher-devised tests that the difficulty of words be aiiangcd on a 
precise statistical basis as is done in some standaidi/cd tests 

How mduy words should be included on a gi\en test can not be stated 
generally The more words, the more reliable are likel> to be the results 
Semester or year tests oidinanly should be longer since more attention may 
be given to scores on them (Sec page 103 for a genera! discussion of the 
number ol test items necessary tor adequate sampling ) 

Spelling tests may be administered orally or m wiiting J1 the administra- 
tion IS oral, words should be pronounced, used m a sentence, and repro- 
nounced Types ot items u^cd in written spelling tests aie shown in Fnzurc 44 

1 Chcct the word which is misspelled 

habilitv ocasion deficit none of these 

2 I he collect spelling ol a woid me imng i inilit uy otTucr is 

<^a) I men mt fb) I Rutenent (c) 1 icmcn nt 
(d) I lutsnent (t ) none of these 

^ Underline the misspelled woids m this passage 

Iheie are m in) things to obsoiNc as \ou nde down (own on the bus 
Besides people and ammuls, tbcie arc stoics cais md consii ilior lobs 

4 I he examine 1 pionounees we>ids toi pupils to spell 


figure 4^ t \ irnplcs of euided response iicnis ind ^echnitjiics loi mtasunng pclling 
ability 

The selection ol a correct spelling among scvcial incorrect ones fe^r a given 
word. Sample 2, is pcihaps the t\pc most liccpicntls used m published tests 
Standardized spelling tests aie axailablc usuali\ as paUs of i ittcries. 
Whelhci to use i slandRidi/ed test i r i leu dl> devised one depends iaigely 
upon the purpose ot lueasi cmcnt 11 this is to evaluate achievement during 
a given period ol instruction, a lest devise'd Irom the reading niateiial and 
spelling words laindit would seem to be appropriate It the purpo^e is to 
compare a pupil or a class wnh age or grade norms, then the standardized 
instrument is mandatory 

7 We aie talking heic about mcasiiimg gcnci d achievement in spelling If the pui- 
posc IS merely lo sec thnt pupils have Icained the words they were directed to study, 
these cUC, of couise, the items for the test 



246 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

It sometimes is desirable to appraise not only how well or how poorly a 
pupil spells but why he spells poorly and what words he has trouble with. 
Among the diagnostic tests available to provide this information is the Gates- 
Russell Spelling Diagnosis Tests (Appendix B., page 470). This instrument 
permits analysis of a pupil’s phonetic and sight disabilities, his method of 
studying spelling words, his auditory discrimination, etc. Examination of a 
pupil’s misspellings in any written work he submits can also be a diagnostic 
procedure. Apparent in his writings will be the words he chronically misspells, 
his transpositions and reversals, his phonetic contusions, etc. 

Vocahulatx. As with spelling, a pupil’s vocabulary may be assessed by 
examining his written products, to sec w'hat words he uses correctly and in- 
correctly, how varied arc these w'ords, and to what subjects and abstraction 
levels they belong. However, this is a scldom-used procedure. When separate 
nieasurcmcnt is given to vocabulary, it largely is through the device of a guided 
response test and usually this is a standardized one. Nearly every English and 
general achievement battery contains a vocabulary section and both group 
and individual intelligence tests measure vocabulary as a significant com- 
ponent of intelligence. 

Ill interpreting the results of vocabulary tests it is w^ell to keep certain 
things in mind. Woids are included because of their relative frequency of 
occurrence in adult writings or in school use, much as with spelling. In the 
tests, words usually are graded as to difilculty on the basis oj what percent- 
age of the standardization population knew them. Their difficulty in this case 
has nothing to do with the complexity or abstractness of the idea they sym- 
bolize. The words in standardized tests ha\e a strong middle-class and literary 
bias. Finally, the tests, as a rule, do not measure the comprehensiveness or 
precision of a pupil’s definition but only his knowledge or lacK of knowledge 
of a given minimum and rudimentary definition. 

Rhetorical Effectiveness, The measurement of such dimensions as sen- 
tence and paragraph facility and style seldom is undertaken separately and 
apart from actual writing. When guided response procedures arc used, they 
usually require discrimination between good and bad phraseology and/or 
knowledge of principles of good writing Two illustrative items are presented 
in Figure 45 

One of the criticisms ot these procedures is much the same as the one 
voiced for guided response tests of WTiting mechanics: they test skill at edit- 
ing, not writing, and the relationship between ability to edit and ability to 
write is known to be imperfect. Another point of criticism is equally telling. 
What constitutes good and bad phrasing is sometimes debatable and, hence, 
what are the correct answers may be indeterminate. 

The effectiveness of any piece of writing ultimately depends upon whether 
a reader considers it effective, whether he responds to it as the writer intended. 
Only a limited number of “rules” may be derived for such “effective” writing. 
Moreover, any student of language knows that these rules change from region 



LANGUAGE ARTS 


247 


13-1. The creek overflowed its banks due to heavy rainfall. 

13-2. The creek overflowed its banks, heavy rainfall being the cause. 

1 3-3. The overflowing of its banks by the creek was owing to heavy rainfall, 
13-4. The creek overflowed its banks because of heavy rainfall. 

(Pupil selects the best sentence ot the four) 


A-, 1 

Column I 

4-lousing seems to be as much 
of a problem in the bird w'orld 
as it is among human beings. 

1 

A 2 

1 

Column 2 

Housing, a problem in the 
'bird world, is also quite a 
iserioiis problem among bu- 
sman beings. 

! 

1 

1 

one-family bird house was 
inspected by two pairs of 
bluebirds in a suburban gar- 
den the other day. 

B 2 

i 

The other day two pairs of 
iblucbirds inspected a one- 
tfamily bird house in a sub- 
turban garden. 

1 

< 

C -1 < 

( 

Both pairs liked it. Both de- 
cided to move in. But did they 
Idraw straws tor priority^ 
i They did not! 

( 

C-2 < 

< 

Both pairs liked ii and they 
decided to move in, and the> 
|did it draw straws for priority 
(HO indeed. 

1 

D 1 

1 

The males haggled a bit, got 
inow'horc, then flew at each 
|olhcr, beak and claw, and 
(fought it out. 

1 

l>2 

( 

1 he males haggled a hit. Thc> 
^got nowhere. They flew at 
LMch other beak and claw, 
f 1 he matter was then fought 
a)ut. 

< 

H 1 

1 

' The winner of the battle and 
)his mate moving into the bird 
|house, quite happy in their 
(HOW home. 

/ 1 he winner ol the battle and 
]his mate moved into the bird 

I o ( 

^ “ ihoLise and arc now quite 

(happy in their new home. 


(Pupil reads each of the columns and then answers questions which require 
that he select the better phrased passage s and know why thev are bellci) 


A. Section A is better expressed in 
A"1 Column 1. 

A 2 ( olumn 2 A ( ) 

a. The interior version of Section A is poor because 
a- 1 emphasis is placed on the wrong pail ot the idea, 
a- 2 the sentence is incomplete. 

a-3 two sentences arc punctuated as if they were a single sentence. 

a-4 it is gramniaticall> incorrect . . . a ( ) 

Etc. 


Figure 45. Two examples of guided response items keyed to ihetoiical skill. (Both 
are from Cooperative Hnglish Tests for High Schools, Educational lesling Service, 
Princeton. Reprinted by special permission of Educational Testing Service.) 




248 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

to region, from culture ygmcnt to culture segment, and from year to year. 
For these reasons tlgre are few standardized English tests that attempt to 
measure rhetorical effectiveness through guided response items. I'he Co- 
operative English 'jests (Appendix B, page 470) and some forms of the 
College Entrance Examination Board English Tests are perhaps the more 
notable examples of such tests. 

One alternative to guided response measurement is analysis of products 
or test essays in terms of rhetorical dimensions. This is a phase of the whole 
method of measuring composition previously described (see page 240) and 
is the procedure most used by English teachers. A second alternative is avail- 
able for certain phases of the rhetorical dimensions. It consists of providing 
pupils with given words, facts, or ideas and asking for their expression in 
certain forms. For example; 

(1) Write a sentence conraining a figure of speech that would describe a 
teacher scolding a pupil for niishelun ior. 

(2) Rewrite these three scnlence^ to give them parallel construction and put 
them in proper sequence to make the best sense; 

The rivers on the plains were swollen hy water from the creeks. Down from 
the storm clouds in the mountains came water to fdl the creeks. A tide of watet 
from the rivers surged out into the sea. 

The virtue of this technique is that it requires writing, no^ just editing or 
recognition of propriety, and it requires the same sort of expression from ail 
pupils, thus permitting some comparison among them. Such items may not, 
though, be scored as objectively and reliably a.'* strictly guided response items 
and they may be applied only to recognized rhetorical conventions. 

Evaluative Standards and Practices in Composition 

As in the majority of school subjects, teachers use largely subjective stand- 
ards for judging the value of their pupils' achievement in composition. These 
subjective standards, what teachers feci is good, fair, or bad work at any 
grade, seem to derive from two objective sources. One consists of the English 
texts, the dictionaries and, above all, the published writings that the teachers 
have read and admired. The second is the writing topical of pupils in the 
grade levels in question. As suggested in Figure 46, the influence of classics 
and style manuals seems to set one axis of Ihc standard and the influence 
of all the pupil themes and letters ever read seems to set the other axis. Sub- 
jectivity enters because the two influences oi axes are never the same for two 
teachers and because the relation of A\, C\ and F\ to the two axes is par- 
tially a function of the temper and training of the given teacher. 

The use of such two-dimensional standards for evaluating composition 
seems to be valid. The published writing accepted as exemplary by educated 
adults is, by definition, the basic standard for adult writing and should affect 
school standards. On the other hand, young pupils should not be expected to 



LANGUAGE ARTS 


249 



Figure 46. How LVcilujtive standai.ls ')Com to opciato for Fnglish composition 


write as well as older ones, so grade differentials in writing should affect 
school Standards as well. It is thought that adherence to the fc.Ilowing rules 
will do much to reduce the subjeetivit\ and L.apriciousncss of composition 
standards. 

1. Write down as explicitly as possiblt* the type of writing and the degree 
of correctness that will bo assigned each mark at each grade Icvch 

2. Collect manv samples of pupil composition awaided different marks 
according to the standard. 

3. If standards arc to be different lor different ability levels, make this 
difference explicit in the standards 

4 Within a school and within a school system, the stalcnicnl of standards 
and collection ol sample pupil products should be a group undertaking Inso- 
far as possible, the same Si mdard should be used by all teachers handling 
comparable pupils in comparable classes. 

Standards for spelling involve a special consideration. As w^e have ob- 
served, the words children must learn to spell in elementary grades usually 
are controlled by a speller or by a list of seme sort. With such an objective and 
exact instructional objective, the standard for evaluating spelling achievement 
can also be objective and exact. Correct spelling of the words accumulated 
up to a given grade might be set as the equivalent of a given mark, with 



250 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

greater and lesser anf^rits being assigned other given marks. The proper 
words/marks equivalence would need to be determined by a teacher ac- 
cording to his experience and the characteristics of his pupils. 

Composition seldom is marked as a separate school subject. On elemen- 
tary report cards it usually is a component of “language,” “language arts,” 
or “English.” in secondary grades it contributes to the grade in English 
classes. Only a very few secondary report cards provide for its separate 
evaluation. When conferences or anecdotal reports arc used instead of the 
traditional A, B, C, D, F report card, specific attention can be and usually is 
given to composition. 

Not to evaluate composition as a separate phenomenon obviously violates 
a basic principle of eflicient marking: the meaning of evaluative symbols 
should be clear to pupils and parents. When a single letter or word refers to 
more than one important phase of a subject, as to speaking, understanding 
of literature, composition, handwriting, etc., there can be no clear commu- 
nication. Moreover, just to evaluate composition by itself may not be suffi- 
ciently informative to pupils. We have seen that composition has numerous 
dimensions. Unless a pupil achieves to the same extent exactly in all of them, 
a single mark or even an evaluative phrase is only an average. The pupil is 
left to surmise how weak, relatively, arc his mechanics and how strong, rela- 
tively, his vocabulary and style, in the teacher's view. 

So, it is recommended that composition be evaluated as#a separate sub- 
ject and that it be evaluated with respect to each of its important dimensions 
This may be done by conversation, by wTiften statement, by letter or number 
marks for each dimension, by a profile graph or by a combination of all. 
I’he use of profiles has been discussed in connection with measuring piocc- 
dures (sec pages 125-126), but the profile is equally applicable to reporting 
evaluations. A profile report card for English, including composition and 
its dimensions as separate entities, is shown in Figure 49. 

HANDWRITING 

In this day of the typewriter, excellence of handwriting is rapidly becom- 
ing a lost art. Even the smallest businessman uses the typewriter for billing 
and business correspondence. The official records of court proceedings, land 
ownership, marriage and death, now are found neatly typed rather than 
elegantly inscribed as formerly. The use of handwriting still persists, of 
course, for notes, for personal correspondence, and above all, for the themes, 
outlines, tests, and homework of elementary and secondary schools. In the 
schools, consequently, ability to write legibly and attractively is an important 
asset still but in the adult world it seems no longer to bc.^ 

As handwriting has diminished in importance, instruction in handwriting 

^ There seems to be no conclusive research on this point but most educators seem 
to think that handwriting is socially less important today than yesterday. 



LANGUAGE ARTS 


251 


has received less stress and evaluation of achi^dbilSit in handwriting has 
languished even more. In the 1910’s and 1920’s A^s, Thorndike, Freeman, 
and others with their “normative” scales and diagnostic procedures seemed lo 
offer elemental y teachers scientific ways of assessing progress in handwriting 
As teachers taught pupils the correct slant, the full arm movement, and had 
them practice 



.MmmmutmM 

/Trmrrmmmrm 

so did they compare the pupils’ wiitings with the scales, give the pupils timed 
tests, and tiy to analyze their dilhcultics In the 1950’s, hov^ever, drills in 
O’s, /’s and m\ have become far less extensive Mjnuscnpl wnting largely 
has replaced cursive writing in the primary grades, and the A>re^, Jhorndike, 
and Freeman scales are used much less frequently tiunign ihe\ still an the 
standard instt aments jot measurement of handwtUim! That no important 
new standardized instruments have been devised tor handwriting since 1933 
IS a telling indication ol declining interest in Us measurement 

Handwriting Dimensions 

Legibility, qualilv, and speed arc the dinisnsions of handwriting most 
Ircciuently ‘‘measured ’ Obviously, the hist two dimens'ons are a I unction of 
the obseivci a^ well as the w liter Attempts have been made objectif) 
legibility by analyzing U into such components as letter Inrmation, spacing, 
slant, letter heigiii, lightnes> — heaviness of stroke and iegularit> To make 
the dimension ot qualitv (beaut), attractiveness, etc ) less subjectivi has been 
tar more dilhcull and the (loit has met with little success 

Means and Forms of Measurement for Handwriting 

Ihc most reliable vva) of measuiing handwriting through use a prod- 
uct scale, begs the question of dimensions A sample of ^ pupil’s writing is 
compared with '*caic specimens lunil one most like the pupil s ib found 1 he 
number or other index ol ' n’ue belonging to this scale bpcchncn is the measure 

‘'White these insttumeut-. die called scales and uhile thcir designers ha\e altempied 
to give them the attnbnies of true sc des, it is thought best )U)t to treat sloics derived 
from them as scale numbers bat simplv as rank numbers The different grides of hand- 
writing in a sc'ile necessarily represent rank c ider among the specimens considered foi 
the scale It seems to be more reasonable to consider meicl\ Ihit the “scale” is repre- 
sentative of rank order for a given group of pupils thin to consider that it is rtpresenta 
live of regular gradations of skill m hindwiitng 


252 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 



Figure 47, Samples from the y4)ns IJandwriiinq Seale, By permission of the Pub- 
lisher Cooperative lest Division of Rdutational Testing Seivicc, Puneeton, New Jersey. 


of the pupil’s handwriting In the Ayres scale (sec Figure 47 and Appendix 
B, p, 472) all specimens arc from the same paragraph ol the Gettysburg 
Address and pupils must write this paragraph under specified conditions. Most 
of the existing scales, like the Ayres, have a single series rif samples that cover 
all grade levels and are progressiv ly more legible and attractive. 

Although a product scale provides a 1 airly reliable comparative index of 
the adequacy of a pupil s handwriting, it does not provide the detailed in- 
formation that may be of use in instruction. For this, it is necessary to ob- 
serve a pupil as he writes and to analyze his wntlcn products in terms of arm 
and hand movement, letter foiniation, spacing, etc. 

Speed of handwriting is easily measured by timed observation of pupils 
as they write various assignments or by timed tests. In the latter procedure 
it is well to use a lamiliar passage so that speed of thinking or of reading may 
not be confused w^ith speed ol writing per sc. According to Freeman a two- 
minute period is optimum lor such a test (29). It is important that pupils 
start and stop on time, that the test passage be similar to the material they 
usually write, and that it contain all letters of the alphabet in about the same 
proportion as they normally occur. 

Evaluative Standards in Handwriting 

The standards usually applied to an evaluation of handwriting are the 
norms of handwriting scales, the letter and word forms cited as ideal in hand- 
writing manuals and, as always, the standards in the minds of teachers. The 




LANGUAGE ARTS 


253 


first type of standard permits evaluation of a pupil's status relative to his 
peers and to usual grade expectations. Keeping in mind a pupiFs ability and 
his opportunity to learn, his progress in handwriting could be judged good to 
poor as he exceeded or fell below the appropriate norm. There is at least one 
published scale, the Conrad Manuscript Writini^ Standards , which can be 
used as an evaluative standard for manuscript writing. 

1 he letters and words in manuals and on wall charts used as models by 
pupils as they learn to write aflford a different sort of standard. As examples 
of what is correct, they may be used to tell a pupil how well he is forming 
letters and words and where he needs to improve; but the> should not be used 
as a basis for marking. As compared with the models, the handwriting of all 
third graders is a failure and not one eighth-grade child in a thousand deserves 
an A. Yet both third and eighth graders might be learning to write with great 
efficiency and thus deserve high marks. 

What any teacher thinks constitutes poor and good handwriting at any 
grade is a vague standard and is easily alTccted by what the teacher thinks of 
the pupil as a whole. Since objective standards exist in product scales and in 
the specimens of manuals and wall charts, dependence a subjecti\e stand- 
ard is thought to be unnecessary and unwarranted. 

In lower grade reporting, handwriting may be given a separate mark or 
comment. In upper grades and in secondary English classes it seldom is 
marked separately, but it alTeets language art and Enidish marks to an un- 
known extent. For the most valid evaluation, it should, of course, receive its 
own specific evaluation. 


LITERATURE 

So far, wc h. ve dealt with the skills aspect language arts instruction 
and nieasurcmcnt largely has been simple and diicct. Now, in literature, the 
other phase of language : Ts, we encounter phenomena whose measurement 
is often complex and alwiys indirect. Instruction relali\c to litcratuie begins 
as soon as puihls can read for pleasure and for vicarious experience. In most 
schools informal and incidental attention begins in the fourth or fifth grade; 
by the seventh grade there are required readings iind anthoiogic ^ may be 
used, and from the eighth or ninth grade on, stud} ol literature ma} consume 
half or more of the time i Hotted to English. 

Dimensions of Literary Achievement- 

It is certain that all English teachers wish for pupils to Icam the names 
of some books, authors, and characters, to know the diflerericc between 
poems and short stories, and to recognize what a plot is. Hence, the general 
dimension of knowledge is an obvious one. Invariabl}, though, in writing and 

10 New York: Bureau of Publications, Teachers* College, ('olumbia Universit\. 



254 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


in discussion, English teachers insist that this is not all they wish to teach 
nor even the most important thing They talk about literary appreciation, 
about improved tastes in reading, and about gaming cultural values from 
reading as another body of English objectives So it is necessary to posit an- 
other general dimension for literature, literary appreciation This term is 
used not because it is a clear one but because it is the term most commonly 
applied to these objectives The common quality in them seems to be that of 
feeling so the term “literary attitudes” might be more appropriate 

Particular courses of study m literature involve varied specific dimensions 
of knowledge and appreciation When a teacher wishes to measure his pupils 
in relation to literature he should focus on specific dimensions most appro- 
priate to his instruction In the following outline are the tvpcs of dimensions 
with which teachers of literature at all crade levels arc likely to be concerned 
The outline nia> be used as a general source or simpl> as an example of the 
way dimensions may be stated 

Some Dimensions ot I itcriiv knowledge and Appicciation 

Knovihdqt 

1 What and how m my things lu known about littrai\ woiks type luthoi, 
title, characters settings, plot, style inhcicnt philosophy lesson oi moral rela- 
tions to other works, publication cvenls etc 

2 What and how many things aie known ibt)ut litciaiy history^ lulhors titles 
chronology, schools inlluenccs, sc’ciocconomic rcl i(u>nships development of 
forms icciirrent themes the yicws ot ciitics etc 

What and how many things arc known ibout liteury lorms and techniques 
anecdote, short story, novel dram i poem essay, sketch, types ol plot of rhyme 
of meter of point of view of ch iiaetenz ition of plot development of nirrative 
and descriptive styles, ol figure of speech literary modes tiagcdy comedy, satire, 
melodrama, epic faicc fable etc 

Apptciiation 

1 Pieferences in re iding, as to titles authors forms subject mailer, cle 

2 Purposes in leading fe)i simple diversion fevr escape foi vicarious thrills, 
for emotional icleasc, foi ennoblement, for acsthelie experience, etc 

Iranstcneel feelings during reiding extent to which the emotional content 
of the re iding (if any) is fell vieaiiously bv the rcaeler mirth, love, r ige, sorrow, 
etc 

4 Transferied values during and aftei leading, the extent to which the values 
expressed or inhcicnt in a work are incoipoiatcd by the reader ethical noims (for 
example, the Golden RiiU) political views (foi example, dorioctacy is tin 
superior form of goxanment), social standards (for example, youth should n- 
sped and ohty their eldas) philosophic dogma (loi example, man a puppet oj 
blind physical jotcis), etc 

5 Aesthetic discrimination extent to which the reader derives pleasure from 
the form, structure, technique, language, etc , of a woik as apait from its content 
and the extent to which he discriminates among different woiks according to 



LANGUAGE ARTS 


255 


their ‘aesthetic quality.” For example, a tenth-grade boy is apt to like adventure 
stories. His aesthetic discrimination would govern to some extent his preferences 
among these titles: Tarzan and the Apes, lorn Sawyer, The Broad Hifjhway, 
Beau Geste, 20,000 Leagues Under the Sea, Tom Jones, Scaramouche, Kidnapped. 
and Captain Midnipht Comic Books, 

The dimensions of literary knowledge and appreciation arc noteworthy 
in several respects. They arc essentially arbitrary. They nearly all represent 
covert behaviors or slates of mind and feeling. Ihcy tend to lap over into one 
another and have vague and unbounded definitions. Because ol these at- 
tributes, careful attention must be given to a statement of dimensions before 
their measurement is undertaken. 

Because they are arbitrary, it may not be assumed that any two efforts 
to measure achievement in literature are related. When a given teacher or a 
given standardized test essays to measure this phenomenon, what is measured 
is what the measuring procedures happen to deal witli and not some inde- 
pendent, always-thc-same entity, “literary knowledge and appreciation.” 
Hence, it is mandatory to say what is meant by literary achievement in the 
given instance and to claim only to measure that. 

Because the dimensions are covert behaviors, they are not measuniblc as 
such and must be “translated” into the overt behaviors directly related to 
them 'rh\s means, for example, that pupil actions must be found which show 
that a pupil knows that “an epic is a long poem or song about a legendary 
national hero’’ and show' that he has incorporated “a sense of personal in- 
tegrity and courage” from reading Invictus. An action related to the first item 
might be to write down the essential features of an epic when asked to do so 
An action demonstrative of the second item would be more diiliculi to deter- 
mine but might be such a thing as a pupil’s “referring to Invictus as he persists 
in an unpopular /^and with his classmates.” 

Finally, because dimensions of literary achievement lend to be vague, it 
is essential that they be defined as precisely as possible before measurement 
is undertaken If, foi insta ice, one dimeIl^ion is to be aesthetic discrimination, 
the factors of a work th.it constitute its aesthetic aspect must be slated: unitv 
of plot and action, variety of phrasing, appropriateness and noveltv of figures 
of speech, naturalness of meter, etc. 

Forms and Procedures of Measurement in Literature 

Description-classification and ranking are the forms of measurement sym- 
bols applicable to literary knowledge and appreciation Scale numbers are 
prcclud(?d because no test or rating “scale” pertinent to literature has yet been 
devised that has the attributes of a true scale. Since the dimensions of literary 
knowledge and of literary appreciation are covert behaviors, their measure* 
ment must be indirect and all procedures of measurement except product 
analysis seem to be applicable. 



256 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


MEASURING LITERARY KNOWLEDGE 

Guided response items useful for measuring literary knowledge are illus- 
trated in Figure 48. The several standardized instruments available for this 


29. I he character iii Ivanhoe who spent several days in a coffin, supposedly 
dead, was 

29-1 Cedric 
29-2 Isaac ol York 
29-3 Hiian dc bois Guilhert 
29-4 Ourth 

29-5 Alhclstanc 29 ( ) 

Litcrar\ knowledge is amenable to liue-lalse, matching and oidcring 
Items as well. 

10. Which one ot the following lines changes Us meter in the middle of 
the line? 

10-1 “T<ikc her up IcndeiK, lift her with care.” 

10-2 “I ovely and any the \iew from the hill.” 

10-3 “My lo\e, ni\ love, niy love, why have you left me alone?” 
lO-^- “I tied him down the nights and down Ihe days.” 

10-5 “f’Oi these icd lips, with all then moiunlul piidc.” 

10 ( ) 

A shoii passage is punted from a work ol fietion, an csSi*y, oi a poem, 
and questions <tre asked such as the lollowing. 

17. This pussage is notable chiefly lor its 

17-1 beaulilul Lmguage 
17-2 coloiful descriptions 
17-3 clcvei dialogue 
17-4 continuous action 

17-5 suspense 17 ( ) 

19. If we were to draw a line to indicate the rise and lall ol interest in 

this passage, it would look appioximalely like which one ol the lol- 

lowing? 

19-1 • 

19-2 / 



19-3 
19-4 


19-5 



LANGUAGE ARTS 


257 


67. The writer’s mood appears to be one principally of 

67-1 optimism 
67-2 doubt 
67-3 anger 
67-4 sadness 
67- 5 patience 


Figure 48. Pxamplcs of guided response items foi mciisLinng litei ■r> knowledge. 
(All are from Cooperative Friglish Tests for High Schools, published h> Kducational 
Testing Service, Princeton Rejiiintcd by special permission ot Fducalional Testing 
Scivice.) 

task (see Appendix B, page 470) are useiiil in ascertaining how much of a 
given body of information given pupils possess. Tt is neccssar\ for English 
teachers to devise their oun tests to measure what and how^ much nupils are 
learning in a given instructional situation. However, ^clf-deviscd tests may 
use the same type of items as the published tests. 

In addition to guided response testing, literary knowledge ma> be assessed 
through frec-response items, Ihey are more ciheient lor knowledge of works 
and forms than for knowledge of literary histor>. The latter may be handled 
so validly through guided response items that there seems to be no point in 
using the Jess reliable and more time-consuming frec-response approach. In 
the case of works and forms, howexer, such essay questions as ‘‘(''ompare the 
verse of vSandburg with that ol Whitman” and ‘"Distinguish between poetry 
and doggerel” are likely to provklc much bettei evidence of what the pupils 
know than a scries of multiple-choice or true-ta!se questions 

Perhaps the most ellicient single procedure lor measuring how well pupils 
comprehend both the uibstance and the technique ol a paiticulai work is to 
have them read it under controlled conditions and then to answer either 
guided or frec-response questions relative to ii. 1 his device, illustrated in 
Figure 48, is frequently u ed in standardized tests with summaries being sub- 
stituted for lengthy work>. If a series id such tests are given which include 
passages cmbodving many dilTcrcnt literary forms and techniques, pupils’ 
knowledge of forms and techniques in general mav he assessed. < Mic of the 
best tests produced dining the Progressive Lducahon As.>ociatioH s evalua- 
tion of experimental High School programs in its “8-Year Study” (22) was 
based entirel}' on O. Hei'^g’s story, A Municipal Report. 

MTASURING IIlTKARV APPRCCIVTION 

Measurement of the other basic dimension of literary achievement, ap- 
preciation, is a far more complicated process and has been notablv unsuccess- 
ful. 4 he two published tests that seem to have the greatest promise have no 
noims and provide no data as to reliability and validity These also w^ere 
products of the ‘‘8-Year Study’s” evaluation stall and now are available 



258 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

through the Educational Testing Service. One, a Check List of Novels, asks 
students to indicate whether or not they have read a series of novels and 
whether they liked them or not. From the pattern of any student’s answers, 
his literary habits and tastes presumably will be revealed. Ihe other, Inventory 
of Satisfactions Found in Reading Fiction, requires that students confirm or 
reject a series of statements about what readers get out of reading or what 
they dislike. Again, the pattern of a pupil’s responses is to be analyzed, this 
time with a view to deternfining why he reads fiction or what significance it 
has for him. 

Literary Preferences. Both of these tests and any other direct approach 
to measuring literary preferences is in danger of the “school solution” error. 
In many areas, certain attitudes receive more social acceptance than others. 
Pupils long have learned what the accepted attitudes are and when asked by 
teachers for their attitudes they usually assert the right ones. They have 
learned that this makes teachers happ> and that it keeps the pupil out of 
trouble. So pupils arc a little more inclined to say they like the classics than 
they actually do and to say that their free reading motives are aesthetic or 
moral, whereas actually they may be to seek thrills or to escape. 

Observation and frec-response procedures are useful in measuring literary 
appreciation. Reading tastes can be determined by watching pupils in the 
library when they have time for free reading; by asking pupils to list the 
books, stories, and articles they read over a period of time; ^nd by asking 
such questions for free response as. “What is the best book you’ve read and 
why? What is the worst book you’ve read and why?” 

Significance of Reading. The significance of reading lor a given child 
or youth is one of the most difficult of all factors to determine. Simply asking 
him in a test “Why do you like to read?” is certain to produce answers heavily 
weighted in the direction of how he thinks you want him to respond If there 
is time, a carefully directed discussion with a pupil about his reading can add 
to and verify evidence from a test. Of course, what a pupil reads is in itself 
indicative of why he reads, since given categories of literature and titles have 
known psychological appeals. 

Transferred feelings and Values. I wo of the most important apprecia- 
tion dimensions, transfer of feelings and transfei of values, arc least ame- 
nable to measurement by tests. I he test situation is likely to produce responses 
that reflect stereotyped viewpoints as well as what the pupil actually derives 
from his reading. Moreover, since feelings and values often lack the unity 
and uniqueness necessary for clear remembering, a recollection of *how I fell 
as I read this” is apt to be gicatly distorted no matter how conscientious the 
pupil is. 

For appraising these two dimensions it is necessary to rely mostly on 
observation and self-evaluation. Pupils, particularly the younger ones, suggest 
their feelings by the degree of their attention and by their facial expressions 
as they read. Their talk and action after they have read a work are partially 



LANGUAGE ARTS 


259 


indicative of any values they have incorporated from the work. But, for the 
most part, only the pupil can know what he feels and what values he has 
derived from reading. He can be taught that such transfer of feeling and value 
is a proper outcome of reading and that he can gauge his own appreciation 
of a work by noting how he responds to it. 

Aesthetic Discrimination. Measurement of aesthetic discrimination may 
be approached by asking pupils to select among passages that \ary as to 
aesthetic quality or by determining whether or not thc> know what arc the 
aesthetic factors in any work, or in general. The first approach is illustrated 
by the following item. 

Which lines of poetry do you like best? 

As passed the noon and came the eve 
a. Then dressed all things as they did grieve. 

Now came still evening on and had 
h. All things in her somber livery clad. 

As evening approached all things turned 
c. Daik — the scisanls of the night. 

I he second means of measuring aesthetic discrimination is upified b) 
this type of item 



Check the lactois 

essential foi good poctis 

a. 

Lxact meter 

e. Repetition and saiielv 

- b. 

Pleasing ihythm 

t. Noble ideas 

c. 

Alliteration 

g. Figiijcs of speech 

-- d. 

Aptness of ^ounds to 
ideas and moods 



The first approach scenic to have the greater inherent validity since the 
task required is the sort of thing a person does when he exorcises aesthetic 
discrimination. It requires, however, that a prior judgment be made as to the 
better of several comparable existing lines or passages or that plausible decoys 
be invented to go with a selected good passage. The advantage of the second 
method is its adaptability to all ac.sihclic aspects and the case with which 
these can be sampled. On th«. debit side, it requites an assumption that verbal 
knowledge of an aesthetic element or piinciple is evidence of the application 
of such knowledge to reading and such assumption may not be warranted. 

Evaluative Standards for Literature 

When a teacher wishes to evaluate pupils’ accomplishment in literature, 
he has little in the way of external standards to help him. Some few of the 
published tests of literature have grade norms but these have no meaning 



260 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


beyond the tests themselves. Moreover, the norms from different tests arc 
not mutually equivalent. The nature of literary knowledge and appreciation 
precludes the existence of any nonrelative standard of correctness or propriety 
such as dictionaries and style books can provide for composition. 

The standards used by a majority of English teachers are for the most 
part arbitrary and subjective. Letter marks or other symbols of value often 
are assigned to guided response lest scores on the basis of 70 per cent ^ D, 
78 per cent — C, 85 per cent = B, 90 percent =: A. Here custom and the 
teacher’s arbitrary decision are the source of the percentages — marks relation- 
ship. In other cases, a free response test is read and assigned dn A, B, C, D, 
or F value with no intervening scoring nor any comparison with a known 
standard. What has occurred has been that the teacher simply and immediately 
expressed the degree to which he liked or disliked the pupiTs answers. Of 
course, the teacher’s liking or disliking is based on his notions of what pupils 
should know and feel about literature. And in view of the teacher’s training 
and experience, it is a legitimate basis for evaluation. However, the teacher’s 
feelings reflect other things than the pupil's test response and there is no 
possible way of verifying this type of evaluation or of checking it for error. 
The problems involved in arbitrary and subjective standards are explained 
more fully in Chapter 9. 

Arbitrary or subjective marking may be minimized by the use of a per- 
formance scale as a standard. j\ teacher, better a group^of teachers, can 
describe the various levels of achievement in literature that pupils are likely 
to manifest in a grade or in a series of grades. 1 hc) can have three, five, or 
however many gradations best suit the differences among pupils. They can 
give each gradation a letter or number equivalent. As they measure the per- 
formance of pupils they can compare their status with the gradations of 
performance on the standard, find the point of best fit, and assign a value to 
the pupil’s achievement accordingly. 1 his process docs not remove subjec- 
tivity from evaluation but it does minimize it. Moreover, it is a systematic 
and rational process, not an arbitrary or emotional one. Such a type of stand- 
ard and manner of marking arc explained in detail in Chapter 9, pages 203- 
205. 

Literature seldom is marked separately on report cards but usually is 
combined with composition and other things in a single mark for English, It 
should, of course, be marked separately; and knowledge and appreciation, as 
the two basic dimensions, should be marked separately. For any detailed and 
diagnostic evaluation of literary achievement, each separable subdimension of 
knowledge and appreciation must, of course, be evaluated individually. 


SECONDARY SCHOOL ENGLISH 


English instruction in junior and senior high schools and in junior colleges 
involves one or more of the aspects of language arts and, hence, little special 



LANGUAGE ARTS 


261 


attention needs to be given to measurement and evaluation in Englis'h per se. 
There are two things that must be stressed. Fiist, English is a composite sub- 
ject and its various parts or dimensions should be measured separately. Sec- 
ond, a number of marks should be issued in English, not just one /I, 5, or D. 

It will do little good, though, to perform separate measurements and 
evaluations if the pupil and his parents are to receive a report card with a 
single mark in English. So far as pupils are concerned, evaluations are signifi- 
cant only when they arc communicated to them. II a single letter, A, C, or F, 
is the only symbol used for evaluating such a complex subject as English, 
the pupil has received an inadequate and possibly misleading communication. 
Consequently, teachers are urged to use a number of marks, one at least for 
each major dimension of each language arts clement m the course 

In Figure 49 is shov^n a very simple composite leport card for use in 



1 s f owp/itt d 

U ith ahihtv 

Pool \i(lplP)l( (.00(1 

t ( o7nf>aifd 
u t(n > h / 1 tjuf/ih i 

th( (itrif fiiodt 

Ut low \m I iyr( Abo\ 

j O i o*' ' aud n ifh 1 

a iufd f 1 ( t 1 

1 all . ?iM( s 

^ ( r| Ilk I fp V ill M' 

7 ^ f 111 11 S> 1 

1 ' 

S/zc ( iiil 
jmrnt n* 

( ornposirion 
M^fJi jiius 

1 II< UiV( n< ss 



1 

i 


R( uliiii^ 

Rau 

lu iision 



1 


SjlC iklMg 

Ml i li nut s 

1 ITi ctivi u( ss 


1 

1 



1 Iti 1 null 

kiio'\lcdpc 

\l>pu( laiuiii 

1 

(• 

1 

1 

1 



I'lguie 49 A piohle reporting lonn lo»- I iiglish ui Language arts 


English The dimensions of each clement are not broken down as extensively 
as they might be, and as they should be during measurement. How \er, a 
further breafdown might coni use the r upil and not justily the additional space 
and clTort involved. 

The first rating column equates a pupil’s perlormantc with his abilu^ and 
entries necessarily will be imprecise and subjective Ellort and gam Irom the 
beginning of a marking period arc judged here The third column entails 
use of a standard that has several gradations ol competence deemed typical 
of successive grades Entries might be made m leims ol age rather than grade 
equivalents. The idea is to evaluate a pupil by telling him that his work is like 
that of the same, of inferior, or of superior grade pupils. The second column 
and its use are thought to be self-explanatory Since so many measurements 



262 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

will result in rank symbols, these evaluations may require no more than a 
summarizing or averaging of measurements.^^ Special comments should be 
entered as needed to clarify or supplement the column entries. 

If usage in a school requires that a single grade be recorded and issued 
for English, an additional composite evaluation still can be given to the pupils. 

Summary 

The principal phenomena of language arts so far as measurement is con- 
cerned are reading (including reading readiness), speech, composition (in- 
cluding grammar, spelling, and vocabulary), handwriting, and literature. 

Reading readiness is measured by observation, ocular examination, and 
readiness tests directed toward such factors as mental maturity, oral language 
ability, and sensory acuity. Reading has various dimensions, rate, comprehen- 
sion, and vocabulary being the major ones, and is measured frequently by 
standardized tests. Reading grade or reading age is the usual index of a pupil’s 
skill. Many tests yield separate scores for various reading dimensions as well 
as a total score. 

Speech is measured largely by observation, testing being inappropriate 
for the most part. Of concern to teachers are dimensions of rate, rhythm, 
pitch, amplitude, tone, enunciation, and diction. Standards for evaluating 
speech tend to be subjective and based on textbooks and the teacher’s experi- 
ences. Use of recordings of pupil speech of various gradations of quality is 
advocated. 

As taught in the schools, composition is a complex of actual writing and 
“separate” subjects of gramma^, spelling, and vocabulary. Actual writing is 
best measured through analysis of pupil writings. Grammar, spelling, and 
vocabulary may, of course, be measured through this means too, but there 
are available for these dimensiems a number of test procedures and many 
published tests. Standards in u.sc arc dictionaries and style books and teachers’ 
ideas of what a given grade should accomplish. 

Handwriting is judged through observation of pupil specimens and by 
comparison of certain specimens with a series of charts showing gradations 
of skill and beauty. Because of this direct comparative way of measurement, 
little attention is given to discrete dimensions of handwriting but among them 
are letter form, slant, and spacing. 

Literature is the most difficult aspect of language arts to measure. The 
basic dimensions of a pupil’s achievement in literature arc knowledge and 
appreciation. Both of these arc covert dimensions and must be measured 
indirectly. Locally devised tests are the mainstay for measuring the knowledge 
items though many standardized tests are on the market. The attitudes and 
discriminations involved in literary appreciation may be appraised by attitude 
scales and by tests of aesthetic knowledge and taste. Evaluative standards are 
almost entirely a function of the individual teacher. 


See Chapter 9 for further discussion of report cards and evaluative standards. 



LANGUAGE ARTS 


263 


Secondary school English comprises most of the aspects of language arts. 
In evaluating achievement in English, it is important that the separate phases 
be separately evaluated 


EXERCISES 

1. Prepare a caid file of published tests for all aieas of language arts that 
concern you. For each lest give identifying data ( publisher, cost, grade, time, 
etc.) and a statement about its validity and reliability Base the latter on an 
inspection of the test and reviews of it in Buros’ Mental Measuuments Yearbook 

2. Prepare an analysis or rating form to use in evaluating pupil compositions. 
Indicate clearly the dimensions to be measured, how thev are to be measured, and 
the standard to be used in evaluating them 

Prcpaic an observation form to use in appraising a student’s public speak- 
ing ability Include essential dimensions and appropriate wa\s of rating them Use 
the form to rate the speech of one or more students 

4. If you specialize in a subject other than English, list the specuA dimensions 
of reading and vocabulary important in the subject. 

5. Foi the lollowing abstract dimensions ol writing and speaking, indicate 
tangible things or overt behaviors related to them 

Interest Humor 

Onginalitv Organization 

Stvlc 

6. Prepare a guided response test lor measunng usage, vocilnilatv, and spell- 
ing 


BIBl lOCiRAPHY 

1 Bujos, Oscar K, Fourth Miriud Measure nicnis ) mr Hook Highland Park 
N. J Gryphon Press, ^ 

2 Caldwell, C. (j , '‘Pioccss ol Communic. non ir C hiltlu n s letters,’ lltnuni- 
tLir\ School Journal, 49 79 88, October, 19f8 

^ Coward, A F , “Comparison ol Iwo Methods ot Giading I nglish ( oniposi- 
tions,” Journal of Ldmaiional Rtuarch, 46 81-91, October. 1952 

4 Dolch, r W, “testing Reading \\ilh a Book,” He nu afar \ I nqh\h 28 124 
125, March, 1 9S1 

5 Dole, A A, and Fletcher, I . M , Jr , “Some Piinciples in the Constiuciion 
of Incomplete Sentences,” I due aticrnal and I'^sycholoqual Mvasurernent , 
15:101-110, 1955 

6 Ebbitt, W. R., and Diederich, P. B , “Validity ol an Lvamination in Wilting, ' 
College Lnghsh, 11-285 286, February, 1950 

7. Fcls, W. C., “College Board English Composition Test, Present and Future,” 
Education, 11.4-10, September, 1950. 

8 Free, R. J , “We Measure Growth Fogethei,” Lducational / each rship, 4 464- 
468, April, 1947 



CHAPTER 1 1 


SOCIAL STUDIES 


‘'Social Studies” instruction throughout the United States consists of a 
wide variety of courses, subjects, curriculums, and skills. In sonic places it 
is a collective noun meaning History, Geography, and C’ivics. In others it 
signifies a twclvc-ycar unified study of society and its problems, without 
regard to traditional areas of knowledge. And, in a great many instances, it 
represents a unified study of society in the elementary grades and a collection 
of “subjects” in the secondary grades. Where Social Studies is considered a 
subject in itself, there tend to be included such “noncontent” items as study 
skills, critical thinking, and social sensitivity, and even personality adjustment 
and vocational exploration. 

Despite this diversity of elements, there is a National Council for the 
Social Studies More and more, high school courses are biding designated as 
Social Studies I, Social Studies 2, etc , even if they are simply U.S. History, 
World History, etc. College majors are being offered in Social Studies or the 
more academic Social Science, d'hcrc seems to be an increasing tendenev 
among schools and schoolmen to treat Social Studies as a subject per 2 >e rather 
than as a means of classification. And, finally, there are many common ele- 
ments among history, geography and civics when it comes to measurement. 

Consequently, measuring and evaluating in the Social Studies is presented 
here as a unit, but attention is given to achievement m subjects, to progress 
in units on society, as well as to problem solving and allied skills ^ Personality 
and group guidance aspects of Social Studies instruction are excluded since 
they are more conveniently handled in other chapters, 15 and 16. The struc- 
ture of the first part of the chapter is like that of the previous one. First, 
attention is given to the phenomena and dimensions ol social studies, then to 
forms and procedures of measurement, to standards and marking practices, 
and finally to certain special factors in Social Studies evaluation: the marking 
of citizenship, the Social Studies unit, and diagnosing disabilities. In the 
second part of the chapter a detailed description is presented of how measure- 
ment and evaluation might actually be conducted in one class. 


1 Buros’ Mental Measurements Yearbooks (4) use Social Studies as a division head- 
ing for standard tests although the bulk of the tests aie specific to history, economics, 
etc. 


266 



SQCIAL STUDIES 


267 


GENERAL CONSIDERATIONS 

Phenomena and Dimensions to be Measured 

In our consideration of measurement and evaluation in the language arts 
we were able to draw heavily on research and on measurement practice for a 
determination of appropriate measurable phenomena and their dimensions. 
Here, however, we must rely somewhat more on logical analysis Published 
research is rare that deals critically with what to measure in the Social Studies 
and current standardized tests of History, Geography, etc., permit lew clear 
inferences as to the dimensions they measure. 

MAP AND CHART WORK 

In nearly all the social studies pupils read and draw maps, interpret and 
produce charts. Hence, one obvious thing that needs measurement is skill in 
map reading, chart interpretation and similar activities. The more important 
measurable dimensions ot these skills include four of the general ones de- 
scribed in Chapter 2. In maps and charts occur many conventional symbols 
and arrangements, proper recognition of which is essential for meaning. For 
example, maps arc full of (railroads), ^ (mountain crests), ^ (air- 
ports), (passes), ^ (capital cities), ^ (degrees). Charts are read from 

left to right and top to bottom, the headings of columns and rows apply to 
the entire stretch of the columns and rows, decimal places are a function ot 
position, and the “legend'’ explains the meaning of dotted lines, ciossliatching. 
\arious colors, etc. Ihcse conventional symbols and arrangements are the 
elements of the map and the chart, and two dimensions of the pupil’s skill 
with maps and charts arc then: 

1. The identity ol the symbols he kiiow'^. 

2. The number of such known symbols 

In addition, the time it takes pupils to read and interpret or to piodiiLC 
a given map or chart is important and ihcreloic rate is a third dimension 
Moreover, pupils must read the symbols aright, mast follow columns and 
rows accurately, must judge scale preciscl), so erna is a lourth dimension 
of the skills. 1 hen wc know' that certain pupils are map-rninded and others 
are not, certain ones refer to charts in man\ diflcrcrU contexts while otncis 
use a chart only when specifically ordered to do so. Consequent!), application 
or the tendency to use or not to use eharls and maps in one or se\eral other 
activities constitutes an additional important dimension. 

KNOWIFDCiE Oh SUB.JtC'rS 

A second phenomenon of Social Studies, even more universal than map 
and chart work, is also more ambiguous and less tangible. 1 cachers insist that 
pupils must learn the facts, develop the necessary concepts, and get the rela- 



268 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


tionships right if they are to understand or to have a good knowledge of 
economics, or history, or local government/- Such knowledge-understanding 
apparently is the name for a construct, an explanation for certain differences 
we observe in the behavior of pupils. If John writes a good essay on Califor- 
nia climate and Bill a poor one, it is customary to say that John “knows” or 
“understands” the subject while Bill docs not. The “knowledge” is not, ac- 
cording to customary usage, the behavior of writing, it is something that “lies 
back of” the writing, which, quite literally, understands the behavior. 

Whatever else may be involved, reineinhrancc of words heard, read, and 
spoken, of figures and pictures seen, is critically involved. So for purposes of 
measurement it may be said that knowlecUie or understanding means the 
totality of a student's remembrance of whatever he has seen, heard, imagined, 
guessed, or other>Aisc sensed that relates to the subject in question. A pupil’s 
knowledge of the geography of North America would mean, for example, 
whatever the pupil can lecall of the text and reference material, of the 
teacher’s explanations, of his answers to study questions, of motion pictures 
and slides, of the statements made b) other pupils during discussions plus the 
additions, deletions, and distortions his own imagining has devised about the 
subject. 

Given this definition, the measurable dimensions of the phenomena arc, 
first, the identity and number of facts or single ideas remembeicd and their 
organization, and the concepts oi groups ol ideas rememberrd. A fourth im- 
portant dimension consists of the feeling correlates ol given items ol knowl- 
edge. Along with remembrances t>f images come feelings ol like and dislike, 
interest and aversion, confide icc and anxiety. Since the significance of a 
pupil’s knowledge-understanding is alTected by such feelings, appraisal of 
them is essential to adequate evaluation. 

To illustrate these four dimensions, we use a pupil s understanding of 
North American geography. Among the components of his kno\\ ledge arc 
the recollection that the mean rainfall m the Southwestern deserts is less than 
10 inches a >car and that Charleston is the capital of West \'irginia. Both 
of these arc “facts” (U single ideas. Among his concepts arc the generalization 
recalled from a chapter summary, “the seaward sides of mountain ranges tend 
to be verdant and the landward side arid,” and a rule of capital location that 
he has thought of himself, “state capitals arc usually in the middle of states.” 
The pupil’s “mind” generally contains a very large number of facts and a 
somewhat smaller number of concepts relative to these. The organization 
among his ideas is one of simple verbal classification; he thinks of cities to- 
gether, then of rivers, then of mountain heights, etc. Finally, to exemplify feel- 
ing correlates, the pupil “likes” to recite place names and populations but 

2 Logicians and even psychologists (.ften distinguish between knowledge and under- 
standing on the grounds that the former has more to do with ideas in isolation while 
the latter is moie concerned with the pattern among the ideas. However, for purposes 
of measurement the two arc hardly separable and wc use knowledge to mean both. 



SOCIAL STUDIES 


269 


feels “uneasy” about what we would call ecological ideas, that people in the 
great plains have a culture different from that of the residents of Atlantic 
coast states. 

ATTITUDES 

A third phenomenon ol common concern in the Social Studies is the 
pupil’s attitudes toward things of social significance. Measurement of attitudes 
is examined in detail in Chapter 15. Social Studies teachers primarily are 
concerned with attitudes toward governmental lorms, nationalities races, 
peers, school, and other matters important for democratic citizenship. 

PROBIEM SOLVING 

Finally, problem solving is a phenomenon that most Social Studies instruc- 
tors may wish to measure In pioblem solving arc included such ill-defined 
entities as critical thinking, reasoning ability, and scientific method. Bv prob- 
lem solving is signified a pupil’s behavior in a situation that requires some 
novelty of action for success. This ma\ be a response to a direction K) prepare 
a notebook, the action a discussion leader undertakes when several of the 
discussants become agressive, or simply a pupil's essay on an “original” topic. 
1 he dimensions of problem s<^lving are, first, the dimensions of the knowledge, 
the attitudes, and the skills involved. In addition, to appraise a pupil's status 
in pioblem solving with any precision it is necessary to determine his goals, 
the ways he tries to accomplish his goals, how long he persists m a given 
effort and, finally, his eiiors in goals or means. 

We have presented four phenomena commonly subject to measurement 
and evaluation in the Social Studies. Obviously, other phenomena will con- 
cern given teachers in given situations. Moreovci, the dimensions indicated 
for the phenomena are not exhaustive. Others nia) need to he added to ac- 
complish special pin poses of measurement m certain viibjects at certain grade 
levels. As other dimensions aie included, \cry careful attention sliuuid be 
given to the conditions ol mcasuiabilit) described in Chapter 2. pa^cs 19- 
28. You may recall that phenomena arc measurable m the degree to which 
they have dimen dons which; 

1. Arc common to several like phenomena, 

2. Provide sensory data, 

3. Are clearly defined, 

4. Manitest variation, 

5. Produce highly similar reactions among many unrelated and imjiartial 
observers. 

Forms and Procedures of Measurement 

In considering now the forms and procedures available for measuring the 
dimensions just described we should notice that the phenomena and their 



270 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

dimensions arc primarily verbal Thus those forms and procedures which 
measure verbal behavior are likcl> to be most appropriate and useful in the 
Social Studies Moreover, a procedure must be “insulated’ against measunng 
verbal skill if a social studies phenomenon is to be measured and not language 
Novvhtrc else is it seemingly so ncecssiry to keep tests of knowledge from 
beme tests of reading ability 

CI ASSiriCATlON AND RANKING \Hl PRINCIPAL I ORMS 

With our verbal phenomena and dimensions we must rel> a great deal on 
deseription and Llassifieation Indexes ol status in problem sohing or map 
skill often must remim as verbal summaries or at best as gross verbal elas 
sifieations aveiagc good pooi tor example Where it is possible or appio 
priatc to compare pupils v\ith one another we can express ihcir status in terms 
of rank ‘Nanc> is in the third quaiter lom stands last m his class Bill is 
at the tilth percentile ’ But save in rare circumstances, wc cannot use sealing 
is a form ot meisuremcnt tor Soci il Studies Because our phenomena and 
dimensions arc verbal and because our comparisons must ultimately be with 
other persons it is extremely dillieult to estiblish a zero or to devise equal 
increments of cliflerencc Equal increments and zeros or other fixed reference 
points are as we have le lined necessities tor sealing 

The rare eireumstinee occurs when a published test of lehievcment is 
Lis^d that has been desiened to \ield stindird scoies o\ synu sort If the 
test Ins been \eiy earelully validated the resulting scores will relate to a 
fixed mean and score ditlercnccs it an\ segment ol the scile rna) be ioughl> 
equivalent to the same dc‘gree of dilTerenccs it an\ other segment 

MEASURING MAP AND CHARI SMLI S 

The map and chart skills of Social Studies are amenable to measurement 
through observaticm and product analysis You can watch a pupil draw a map 
or fill m a time line and thus ippraisc his eonecntr ition span and his speed 
You can appraise his map or inspect his chart to determine what map svmbols 
he uses acciiritcl> and how much he knows of chart piotoeol There are pub 
lished guided response tests tor map skills and several standardized reading 
tests have sections on map reading Pupils compositions based on maps 
and charts can be analyzed tor the maj) and chart reading skill there demon- 
strated 

TESTS or KNOWIFDGF 

On the other hand, observation and product analysis probably are in- 
adequate for measurement of Social Studies knowledge To get at the dimen- 
sions we described tor knowledge and properly to sample its extent, we need 
the advantages ot standard stimulations and at times, of standard responses 


3 See Appendix B, pige 489 



SOCIAL STUDIES 


271 


in addition to standard analysis systems. Consequently, we must use a great 
many free and guided response approaches: essay questions, things to list, 
charts to fill in, multiple option and true-false items, matching questions, etc. 

APPRAISING PROBLEM SOLVING ABILITY 

Problem solving is most frequently measured through observation. Where 
situations in which the pupil is observed are problem situations to the pupil, 
this procedure can be effective. Certainly, observation is inherently the most 
valid procedure for measuring complex behavior. However it is time-consum- 
ing and often unreliable. 

Problem solving facility can be measured to some extent by properly de- 
signed papcr-and-pcncil instruments. Illustrative of these is the test ot ‘'Ability 
to Apply Social Facts, Generalizations, and Values'’ contrived and used by 
the Progressive Education Association in its Eight- Year Study of Secondary 
School Programs (15). In this test the student is confronted with a problem 
situation, is required to choose among three courses of actiem, and then to 
support his choice with reasons. The student’s reasons arc analyzed according 
to a schedule and, ncedlcsss to say, carry more weight in scoring than his 
choice of action. A second example of a “test” that deals with problem so'v- 
ing was designed by Edwards to measured development in critical thinking 
(7). Facts are presented about the “common cold” and the student is re- 
quired to compare the validity of a number of statements purporting to be 
based on these facts. 

SrANDARDIZr.O TESTS 

When you wish to use verbal tests to measure Social Studies phenomena, 
be they study habits, knowledge, or problem solving, you frequently may 
choose between constructing your own test and using a published one. The 
Fourth Mental Measurements Yearbook (4) lists -eight published instru- 
ments for use in the Social Studies, distributed as follows; 

Social Studies (generaP 7 History 

CTvics and llisloiy 1 Political Science 10 

Economics 4 Sociology - 

CK'ography s 

In addition, most achievement batteries have Social Studies components and 

many published interest and attitude tests are applicable to certain Social 
Studies tasks. The titles of certain tests, together with identifying and critical 
notes are presented in Appendix B, pages 462-492. 

In using them, it is well to be reminded that published tests in Social 
Studies probably were not made just for your purpose nor \our class. They 
have to be generally applicable to be published and therefore may not fit your 
situation perfectly. Furthermore, the only absolute advantages in a published 
test arc that it may have large population norms and can yield norm scores 



272 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


while yours will not. But, if you construct your test, use, interpret, and 
revise it properly, yours can be just as reliable, just as valid, and possibly a 
great deal more pertinent than the published ones. It usually is necessary to 
use published tests in measuring many schools or classes and in comparing 
any given group of pupils with large populations of pupils. 

Evaluative Standards 

Generally accepted and uniform external standards do not exist in the 
Social Studies. This is in contrast to the situation in Language Arts with its 
dictionaries for spelling and pronunciation, its handbooks for usage, and its 
charts for penmanship. In lieu of generally iccepted and unilorm standards, 
typically three sorts of things are used. 

1. Statements of instructional objectives 

2. Textbooks 

3. The teacher's ideas or ideals of how things should be 
OBJECTIVES AS STANDARDS 

In the case of the fiisl, the objective may be an\ thing fiom a very general 
single statement, “Lcain about the past and its relation to the present,” to 
a dcladcd list of statements, “Recogni/c scveial primary and secondary 
sources,” “Discriminate among biased and objective soiiices^ etc.'' I he rela- 
tive adequacy of such course or cuiriculum objcctivTs as evaluative standards 
depends on how well they meet the conditions picviously established tor them 
(pages 193-194). 

To examine the usefulness oi curriculum objectives ,is standards, let us 
look at the Social Studies objectives of a ninth-grade Basic Studies roursc 
in a incdium-si/cd high school (9), and test these objective's against the char- 
acteristics ot a valid standard. The objectives are: 

1. To acquaint the student with the society in which he lives. 

2. To inculcate an appreciation tor the social customs of the times and 
man's cultural heritage. 

3. To teach the need for civic responsibility, and to kindle enthusiasm for 
active participation in our government. 

4. To develop the abi'ity to listen actively and to judge ciitically when 
ideas are presented. 

5 To encourage group activities and co-operative planning. 

6. To understand man in relation to his physical environment and to see 
the need for conservation of natural resources. 

7. To “sell” the advantages of the “American Way of Life ” 


Among the conditions of adequacy established for evaluative standards 
arc reasonable objectivity, clear definition, constancy, expression in terms 



SOCIAL STUDIES 


273 


comparable to those used in measurement, practicality, and, foremost, they 
must constitute a clear-cut variation scheme of quality appropriate to differ- 
ences among pupils. 

How well do the objectives meet these criteria for standards? 1 he terms 
in 2, “appreciation” and “cultural heritage”; in 3, "responsibility"; and in 7, 
“American Way of Life,” have widespread feeling connotations and arc in- 
volved with the individual’s personalized motives and beliefs. All the objec- 
tives contain vague terms. Consequently, the objectives are probably too sub- 
jective and ill-defined for efficient use as evaluative standards. 

It is true that they probably are reasonably constant and are preictical. 
Revision of the course and its objectives should not be anticipated for five 
years or more. Moreover, the objectives do not seem to represent the caprice 
of a given teacher or a deviate philosophy. No protocol, tin}e, or expense is 
involved in their use. 

However, the objectives are not expressed in terms comparable to those 
that will express the measurement ot status. Neither do they satisfy the 
most important condition of all: that they must constitute an appropriate 
variation scheme of quality. I'he measures probably wd) be “75 per cent” 
on a 3()-item test, B on a notebook, etc., and such comments about behavior 
and study habits as “noisy but good-natured” and "doesn’t concentrate casil\.” 

I’he objectives are statements of a given accomplishment level or goal. 
Any spread or range of individual difference must be implied. For example, 
“. , . to judge critically when ideas are presented” permits us to say “fine,” 
“OK,” “you made it,” or st^me such to students who do whatever wc mean by 
“judge critically.” But what should wc say about the student who falls short 
of this level, and of the third student, who we feel is even worse than the sec- 
ond? Are W'e to say that both are simply “unsatisfactory”? 

Now, it is considered that these particular objectives are not atypical of 
wSocial vStudies course objective^, gcncrall) Consequently, it is probable that 
course objectives as such are not likely to constitute good evaluative stand- 
ards. They can be the source of .standards or a point on a standard scale and, 
of course, they may be leconi trued to become standards 

lLXTB()OK.vS \S STaNDARD.S 

Using a textbook as a standard against which pupils are judged i- likely 
to be even more invalid than using com sc objectives. While they arc ob|ecLive, 
clearly defined, con.stant, aiivl practical, they have nothing to do with the 
differences among pupils or with the quality of jmpil acliicvement. Textbooks 
are only a source of information or a means ol instruction Their role in 
evaluation is to constitute one of several compilations of facts and concepts 
against which the pupil’s knowledge may be checked for extensiveness and 
accuracy. Thus they have more to do with test construction and scoring than 
with marking. 



274 


CUSTOMA RY USES OF MEASUREMENT AND EVALUATION 


TEACHER OPINION AS STANDARDS 

t 

It is useless to try to vt<iidate teachers’ ideas against our several condi- 
tions for validity. They comt^ in all si2X:s and shapes and, like small boys, 
won't stand still. Needless to siay, they fail to meet the first condition of objec- 
tivity completely and tend to inconstant. But they can measure up to all 
the others. Whether they do or not is a function of the given teacher. Objec- 
tivity and constancy can be approached by writing down whatever concepts 
of “geographic” or “contemporary affairs” knowledge the teacher intends to 
use as evaluative standards.^ To write them down is often sufficient impetus 
for thinking them through. \Vhcn this is not done, feelings about pupils (per- 
haps based on their misbehavior or the teacher's dyspepsia) may be the cle 
facto standard. 

SIRICTLY RELATIVE MARKING 

The final current pos.sibility is to use no standard at all. This occurs 
when test scores arc con\vrtcd directly into grades on some arbitrary or sta- 
tistical basis. In illustration, the a priori decision that 70 per cent is passing 
eliminates any reference to a standard. The allocation of A, B, C, D, and T 
to raw scores on the bip;is of rank in a class or the characteristics of a fre- 
quency distribution is, another arbitrary way of judging quality. Here status 
is merely symbolize^ in another way. In Chapter 0, pages 194-195, there is 
further discussion of marking on the basis of a distribulion^and on arbitrary 
percentages of te.st questions answered correctly. 

Appart;ntly then, to evaluate properly in Social Studies requires the design 
of standards and not their selection only. The use of a performance scale 
Containing several defined levels of achievement seems to have the most 
promise for evaluating knowledge-understanding (sec Chapter 9, pages 203- 
204, for a generalized form of one). In a section to follow (page 279), a 
performance scale is described for knowledge of U.S History 

Marking and Reporting 

With or without adequate evaluative standards, teachers must mark pupils' 
achievement in Social Studies subjects and they do so in much the same man- 
ner as they do other subjects. Except in the primary grades, the A, B, C, D, F 
system prevails for subject achievement. In the primary grades, some sort of 
“satisfactory,” “unsatisfactory” rating together with provision for comments 
is becoming a fairly common practice. Reporting procedures for Social Studies 
similarly are little different from those used in other subjects. A single letter 
mark typically is placed by the subject (fifth-grade Social Studies, eleventh- 
grade U.S. History, and so forth) and signifies the pupil’s over-all achieve- 
ment in the course for the period reported. In many primary grades, parent- 
teacher conferences and/or anecdotal statements have replaced the traditional 



SOCIAL STUDIES 


275 


report card. Teachers comment that s^ch reporting practice is more satisfac- 
tory for Social Studies in the primary -^ades than is letter marking but that 
it is more time-consuming and requires more skill. 

Citizenship and Psychological Grading 

All elementary and high schools are concerned that pupils behave them- 
selves properly and that they develop desirable traits of character. Also, the 
great majority of teachers arc basically kind and sympathetic toward their 
pupils. Salutary as both these conditions are, they greatly complicate the 
problem of measurement and evaluation. They complicate it in all classes and 
subjects, but the full brunt of “marking on citizenship” is felt in Social Studies 
classes. 

Practices relative to evaluating citizenship vary from no admitted atten- 
tion to the matter through lowering of subject marks because of adverse 
citizenship to a specified grade or grades in citizenship Where citizenship re- 
ceives no admitted attention, it is doubtful that it receives no attention. The 
safest assumption in such case is that the teacher’s assignment of subject 
grades is influenced in unknown ways and to an unknown degree by the social 
behavior of the pupils. Consequently, we have grades with unknowm signifi- 
cance: how' much is subject and how much is citizenship? Where marks ma> be 
lowered for breaches of the peace, a similar situation obtains unless how much 
the grade is lowered is clearl) coinmiinicated to the pupil. It is thought that 
use of a separate grade or grades for cili/cnship is preferable to either of the 
other practices and we offer three suggestions as to how it may be done more 
cflectively.*^ 

1. Citizenship is a complex, identih the components and mark them 
separately: co-operation, honest), care of property, for example. 

2. Observe the pupil’s status tor these tactors, lecord pertinent e\cnts. and 
base marks on the record, not on recollection or intuition. 

3 Let marks represent the pupiFs status compared with a standard. This 
standard needs to be appropiiatc to the age of the pupils to which it is applied. 
It should not bv: the teacher’s idealized conception of citizenship for himscli 
or another well-educated adult. 

As for “psychological marking,” the term is currently used for sn nations 
where marks are assigned on the ba^is of their effect on the ego of the pupil 
rather than on the basis of his achievement, if a B is what the pupil needs to 
satisfy his desire for success, t>ive it to him. If a D will give him the jacking 
up he needs, assign the D. It matters not that the B student docs infeiior work 
and the D does superior work. Marks, in these cases, evidently are intended 
to be therapeutic rather than informative. 

Since the ordinary and usual significance of letter marks is that ol value 

* The measurement of character and citizenship is examined in greatei detail in 
Chapter 15, pages 415-419. 



276 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


symbols, you are advised not to engage in psychological grading. To express 
sympathy for pupils is fine, to motivate them is fine, to have mental health in 
the classroom is fine, but you can do these things without confusing the evalu- 
ation process. Praise the pupil for gain, upbraid him for laziness, and be kind, 
accepting, permissive in instruction, but let his marks symbolize the value of 
his achievcmcnl according to some known standard. 

The Social Studies Unit 

The “unit/’ as usually found in the elementary grades, means a group 
or series of pupil activities having a single over-all purpose. As a rule, the 
pupils and the teacher plan their objectives and their activities together; groups 
of pupils work separately at their own rates to accomplish things connected 
with the unit ob|ectivcs; periodically there is interchange of information 
among the groups; the unit culminates on the completion of group tasks with 
an all-class activity: discussion, pupil presentations, etc. An example might 
be a unit with the objective, “Understanding the Indians in Colonial Amer- 
ica.’* One group makes a mural, aniHher an Algonquin Milage, a third de- 
signs Indian costumes and weapons, etc. [ he unit culminates with a pageant 
to which parents are invited. 

It may be apparent from this sketch, and it is undoubtedly knowm by 
teachers who use the unit, th.it evaluation here entails a number of special 
problems and considerations. 

First, a unit is a device by which pupils ma) practice talking, writing, 
reading, construction, arithmetic, etc., in a “make sense” situation, all in 
addition to accomplishing the »tated objective ot the unit It is suggested that 
measurement-evaluation of progress in these skills be peilorm''d in two ways: 

1. Periodically duiing the scmestei without any reieience to units through 
appropriate Irce and guided response testing and pioduct analysis; 

2. Casually during the unit through observation. 

In the latter case, the teacher is advised to watch estrcinilics of status and 
not to try to appraise all the pupils. During the unit, it is possible to detect 
disabilities and talents as they may express themselves in the pupils’ lives, 
because of the nonbookish or nondrillish quality of the instruction. And the 
teacher in a unit situation generally is loo busy to evaluate all the pupils on 
their progress in all skills. 

Second, the unit tends to have conceptual and performance objectives or, 
in the latter case, rules. In the example cited, “knowledge of the food-getting 
practices of the Eastern seaboard Indians” might have been a conceptual 
objective, while “to share my ideas with others” and “to be neat in my work” 
might have been two perlormancc rules. Now these objectives and rules fre- 
quently are considered the evaluative standards against which the pupils’ 
learning and actions are to be judged. To use them properly as standards 



SOCIAL STUDIES 


277 


requires that they be restated according to the conditions requisite for stand- 
ards. In illustration, “neat in my work” needs to be operationally defined, 
picks up waste paper, washes out paint brushes, etc., before pupils may 
validly be judged to exercise a given degree of neatness. 

Jn the third place, much of the pupil work in a unit is diversified as to type 
and dilficulty. In consequence, pupils may not be compared directly with other 
pupils m the class as a means of determining their status on unit objectives. 
Moreover, ratings on objectives and rules given to one pupil are not neces- 
sarily comparable to the ratings given to another pupil. In the absence of 
direct comparability, the teacher may resort for classification or ranking pur- 
poses to written or remembered stereotypes based on Uirf^e numbers of pupils 
of a given age who have engaged in similar unit activities. 

Fourth, evaluation in the unit often takes the form of group evaluation 
and self-evaluation rather than teacher evaluation. A justification for these 
procedures, other than their supposed inherent merits, is that the unit is 
socialized instruction and that the evaluative aspect of instruction should be 
similarly socialized. In the view of the authors, the conversations with the 
teacher about “how Fm doing” generally do help the pupil evaluate himself 
validly. The other devices, however, the self-rating forms, the group discus- 
sion, the committee members' ratings of one another, seem to possess no 
inherent magic. As cmplo)cd in some instances they aic cflective and, as 
employed in other instances, they do not constitute any degree of measure- 
ment or evaluation as we have defined the terms. And by no means should a 
teacher consider that group or selt-cvaluation may replace teacher evaluation 
in unit instruction or, lor that matter, in any other mode of instruction. 

Finally, wc need to recall that the unit has little standardized “content.” 
As a rule, all pupils do not read the .same book, use the same references, or 
.study the same topics. Con.sequcnlly, tests of an informational or factual sort 
on the subject of me unit are generally inappropriate To tlw extent that all 
pupils do >tudy the same thing, such tests arc, of course. clTcctive 

Diagnosis of Disabilities 

Diagnosis of disabilities is an intrequent activity in the Social Studies. 
Publishers' catalogues and test bibliographies list no diagnostic in^t^unlcnls. 
It is apparent, nonetheless, that pn'grcss in history or civic.'* is not ahing an 
even front lor many students and, hence, diagnosis may be a desirable teach- 
ing activity if not a usual oc 

Educational diagnosis is a matter of identifying the important f.ictors in 
a performance and of measuring a pupil's status with respect to them. For 
example, in Civics, the factors of motivation, personal adjustment, intel- 
ligence, reading, study habits, vision and hearing, general experience, as well 
as his specific instruction and performance under instruction, may be the 
principal dimensions or variables that determine a pupil's status in this sub- 



278 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


ject. To diagnose why a pupil is failing in Civics requires measurement of his 
status with respect to each of these factors and then a careful analysis of the 
measurements. 

Instrumented diagnostic procedures a teacher may use are available for 
intelligence, reading, and study habits only.^’ There are certain clinical tests 
for motivation and adjustment but these require special skill for administra- 
tion and interpretation. Through observation and questioning some rudimen- 
tary appraisal may be made of the pupil’s motivation and adjustment and his 
study habits as well. His cumulative folder and/or a questionnaire addressed 
to the pupil or his parents may disclose something of his experience. The 
nurse’s records may be consulted for vision and hearing data. An inspection 
of lesson plans should be informative about what and how the pupil is being 
taught. Finally, a performance profile in Civics can be drawn through observa- 
tion of the pupil as he reads Civics material, through appraisal of his com- 
positions and notebooks, through listening to his recitation, and through at- 
tention to the types of test items he gets and misses. You are getting at his 
profile if you are recording phrases like his- -“looks at pictures and reads 
captions carefully, skims the rest''; "mentions specific events and conditions 
only in composition,” “misses ‘thought’ questions completely.” You arc not 
drawing a profile if you are just saying “attends to his reading,” “docs poorly 
in his notebook,” “fails most tests.” 

These data (however gross the measures m some cases) enable the teacher 
to interpret and perhaps properly handle the pupil’s Civics situation, not 
simply judge it. Obviously, making the measurements is laborious and inter- 
preting them is difficult. 


EVALUATING ACHIEVEMENT IN U.S. HISTORY 
IN THE EIGHTH GRADE, A PROTOTYPE STUDY 

In previous sections of this chapter the what and how of measurement 
and evaluation in the Social Studies is discussed in general terms and, in other 
chapters, general processes and principles of measurement and evaluation arc 
presented in detail. To facilitate application of these generalities to a particular 
instructional situation, we shall describe now how a teacher might engage in 
measurement and evaluation in a particular Social Studies class. Since it is to 
serve an illustrative purpose, the evaluative activities to be described are more 
extensive than is customary and are somewhat idealized. 

Our prototype will be a class at the eighth-grade level because there we 
are likely to encounter most of the issues of measurement and evaluation that 
obtain for the Social Studies generally. Let us suppose that the cLiss has thirty- 
five members, eighteen boys and seventeen girls, and is in a junior high school 
in a large suburban school district. The school’s report card is in terms of 

An annotated list of tests in these and other areas is in Appendix B. 



SOCIAL STUDIES 


279 


A — excellent, B — good, etc., with only one mark permitted per subject but with 
some space for comments. The class has an age range of 12-6 to 14-8, a range 
of IQ’s from 79 to 133 with a median of 104, and reading levels from the fifth 
grade to twelfth grade. 

Course Objectives. In the first semester of the school year, attention is 
to be given to colonial development, the Revolution, and the Constitution, oi 
roughly the period 1607-1789. The stated goals of the course of study are to 
improve : 

a. The pupils’ understanding of events and conditions on the North Amer- 
ican and to some extent the South American continent from 1607 to 1789, 
together with their significance for current affairs 

b. The pupils’ attitudes toward government processes, minority groups, 
and school life. 

c. The pupils’ study habits 

Evaluative Dimensions and Standards 

According to our protocol for evaluation (see Chapter 9, pages 197- 198 ), 
the teacher’s first step is to establish what is to be measured (the dimensions 
of achievement) and by what standards they arc to be judged. The objectives 
of the course, since they are stated in behavioral terms, may constitute llic 
three basic dimensions. In addition, each of these has several subdimensions 
which will be the actual focus of measurement. 

1. KtiowJcd^c-undct standing of the period 1607-1789. For this, identity, 
number, organization, and accuracy of ideas are the primary dimensions 1he 
teacher uses as a standard a performance scale for knowledge-understanding 
much like the one described earlier, page 204. I his consists of five levels of 
understanding, ranging Irom that which is unacceptable to that which is ex- 
emplary, together with the performances which distinguish each level 


Lt'S el Perjorniance 

\ Unsatisftictory Negligible remembrance of anything read or presented 

Inaccuiacy is at a high level. 50 per cent or more ol 
pupil's ideas aie enoneous. 


2. Barely sdtisfaetor\ Nearly ail statements and answers depend on rote 

memory of specific things presented and arc aeeurate 
only for simple ideas’ Columbus discovered America, 
Washington was the first president, cic. 

3. Satisfactory In addition to remembering more facts more accurately, 

the pupil can make some comparisons, point to some 
relationships, and make some discriminations among 
the events of U.S. History. He is inaccurate frequently 
but usually is aware of it. 



280 


CUSTOMARY USES OF . MEASUREMENT AND EVALUATION 


In addition to remembering, comparing, etc., more ac- 
curately and more fully than in Level 3, he can explain 
and interpret with fair accuracy any major series of 
events or era studied. He can group sequential items 
properly and express a few appropriate gencializations. 

Excels at levels 2, 3, and 4, and shows maximum 
accuracy tor eighth graders. He can do some evaluating: 
for example, that too many political parties create con- 
tusion, and can combine facts and ideas taken from 
several ditferent sources into a coherent exposition. 

2. The pupils' attitudes. The tcachci notes three dimensions for his sec- 
ond item of achievement, “the pupils’ attitudes to vard selected current civic 
and social matters.” 

1. The number and identity of items toward which the pupils have feel- 
ings. 

2. The valence (like or dislike, for or against, etc.) of the feeling for 
each item. 

3. The intensity of these feelings. 

The question of a standard for attitudes was restdved more simply than 
was the standard for understanding On the basis of his experience, the teacher 
decided what were g(K)d attributes for eighth graders. Moreover, he assumed 
his class was a ‘'normal” one and hence a high position in the \'lass with 
reference to possc.ssion of the good attitudes indicated a high value for the 
pupil’s attitudes and low position, low value. One standard is to be the pupil's 
status at the beginning of the hrst semester of the eighth grade and the other 
is the range of attitude scores for all his pupils at the end of the first semester, 
all with respect to possession of prescribed attitudes Value becomes then a 
function of change from initial status and of rank among peers. 

3. Pupils' study habits. While seemingly a less imposing factor than the 
others, evaluating pupils’ study habits involves appraisal of a large number of 
dimensions, study being a complex of behaviors, thoughts, and feelings. The 
teacher of our prototype class enumerated the following dimensions for study 
habits: 


4. Good 


5. Excellent 


Readinii Skilh 

1. Identity and number of ideas remembered after reading them. 

2. Rate of such reading. 

3. Differentials for 1 and 2 according to the following purposes of reading: to 
remember, to skim, to search. 


Study Skills 

4. Identity and number of study and nonstudy actions in given period. 

5. Components and patterns of searches for materials. 



SOCIAL STUDIES 


281 


6. Amount and kind of notes taken. 

7. Rate of solution to search problems. 

Motivation 

8. Attitudes on these items: assigned reading, assigned writing, “easy” assign- 
ments, “hard” assignments, creative assignments, and new ideas. 

9. Attention span: how long (continuous) on a specific idea or task. 

10. vStiidy span: how long (discontinuous) on a general idea or task. 

11. Identity of self-view as a pupil. 

As for standards for study habits, he felt that adequacy here or the lack 
of it is directly related to achievement and, hence, needs no independent 
evaluation. He js satisfied to determine the status of the pupils and inform 
them of it. 

Forms and Procedures 

As a second step in evaluating the achievement of thirty-five pupils in an 
eighth-grade Social Studies class, the teacher plans the lorms of measurement 
and the procedures to be used. These arc selected on the basis first of appro- 
priateness to the dimensions and standards and then of relative validity, reli- 
ability, and efficiency. The procedures adopted arc displayed in Table K' 
along with the dimensions and standards 

Performing the Measurement and Evaluation 

Now, after all this decision and planning, the teacher is ready to do some- 
thing that actually constitutes measurement. During the first three days of 
the semester, as he and his new pupils discuss objectives, assignments, class 
rules, and such, he explains what he intends to do in the way of evaluating their 
progress. Among the points he makes arc these: 

a. You will be graded on v hat you do in U.S. History, plus what >ou seem to 
learn about being a good citi/en. 

b. To determine vhat you are learning and not learning, I will use tests, look at 
your p.ipers, and obser\c you. Your job in this class has many parts, so I’ll 
have to judge many things about you. 

c. No one who works as hard as he can should be afraid of fading. 

d. We’ll have a period once a week when study will be the order of the day. In 
these poiiods e<ich ot you will have a chance to talk to me about vour work 

INHUL 1IS1S OP' KLADJNG AND AimUDES 

In the first week, he administers tw'o tests to the pupils. One is a standard- 
ized reading test that measures rale and comprehension, has subtests for skim- 
ming, searching, etc., and yields scores in terms of reading grade levels.’* This 
is a part of his effort to appraise their study habits. The other test hardly looks 

Published reading tests arc listed and discussed in Appendix B, pages 486-489. 



T\BLE 10 

\n lllustrati\e Plan of Measurement and Evaluation 
for Eighth Grade Social Studies 





» — 
O C -O 





JU 


3 



282 


Classification and ranking Guided response 



Reading skills 1 I ) \cLurac\ and rumbei 

of remembrances 



283 


(9) Sludv span (9) Time m minutes (9) Pupil log 



284 CUSTOMARY USES OF MEASURl^EF^AND EVALUATION 

like a test. It is his AeanS of getting the pupil’s* oeginning attitude toward, 
government, minority groups, and school Itfe. Table 11 contains the instru-' 
ment in abridged form. 

When the reading tests are scored, he fills in the prq^e on the front page 
and files each pupil’s test in his folder. He now knows approximately what to 
expect from each pupil in the way of reading and is able to identify those 
who need special help in reading. The attitude tests are scored according to 
the forms of measurement he has planned and then put in the pupils’ files. 
Since he intends to use the test again at the end of the semester, he neither 
passes the papers back nor tells the pupils their scores. 

FREE RESPONSE TEST ON A HISTORY UNIT 

The next effort at measurement is the use of a free response test at the 
end of the lifst U.S. History unit, “A New World Comes into the Old.” The 
teacher puts the following questions on the board and gives the pupils forty- 
five minutes to write their answers. 

1. Name the person, persons, or group who oiiginally settled each of the English 
colonics. Name the colony and alter it put those who settled it first. 

2. Compare the explorations ot the Spanish and the English in the two worlds. 

3. Explain the effect of the Crusades on European peoples 

4. Explain why England succeeded in coloni/ing Noith America and why the 
French failed. 

The teacher marks the papers by using an analysis schedule of the sort we 
described in Chapter 5, pages 77-81. The schedule is no more than direc- 
tions to himself to look for certain things and do certain things to each paper 
so that he may appraise each pupil’s understanding fairly 

a. Count the separate and relevant ideas in the answer to each question. (You 
will notice that each question is keyed to a dificrent level of understanding.) 

b. Check for accuracy of statements, write the numbers o1 inaccuracies lor each 
question, and express as a per cent of the number of relevant sl.itcments in each 
question. 

c. The pupil’s total store is the sum of the number of lelevant statements in each 
question times the number of the question (representing levels ot understand- 
ing). 

In assigning a letter mark to a paper that has been scored, he looks a-t the 
total score, the error percentages, and the score for each question, and com- 
pares what these seem to indicate with his standard. He gives a similar test at 
the end of each unit and these, plus the marks on papers and notebook, largely 
determine the pupil’s grades. A “sample” pupil’s paper as he marked it is 
shown in Table 12. 



TABLE 11 


Test of Attitudes Toward Government, 
Minority Groups, and School Life * 

HOW DO YOU FEEL ABOUT THINGS'^ 


Name Date 

Directions: Things are listed here that some boys and girls like and others dislike. 1 
want to see how you fed. Show whether you like or dislike each item and how much 
by putting a check in the proper column. Be just as sinccie in your answeis as you can. 


To talk in class 

Football games 

Secret ballots in elections 

To have Negro friends 

Civil SeiMce woikers 

Academic subjects 

All members of my enmd to have 
the same religion 

To get special favors from a teacher 

To write compositions 

Studying 

To vote in elections 
7 o go. to school 

bor someone to tell me what to do 

To be with Jews 

To embairass the teacher 

Politicians 

lo make a speech 

Representative government 

Activity subjects 

People who tease strangers 

lo compromise 


Like Indif- Dislike 

Strongly Like ferent Dislike Strongly 



* This instrument has not been \ ululated and not for use It is presented as an 
illustration only. 


285 



2M 


CUSTOMARY USES QF lYAt^OATlON 


3AiLQ>»'12 



S-1 


Maryland •— Lord Baltimore 
Rhode Island — Peter JlmuetX 
Pennsylvania — The (jKlcers 


^ 2. Well the Spanisl^ went *10 South America and Central America and Mexico 

and the English *^ame to North America Also the English didn’t look 
for*|old as much as the Spanish, and the' Spanish *were earlier 


3 3 The Crusades wcrc‘^attles between the Ciusaders and the Moslems over 

the Holy Land. The Crusaders learned to like §pices‘iind other things 1 
think that they brought*1)ack some books and people from Europe learned 
about the rest of the world 


J — / 4 It was because ‘ihe rnghsh won the French and Indian War — because 

there ‘^werc mj^rc English people — because the English *^govcrnmcnt was 
more interested in colonizing 

S \ ! =. S 20 \ 

^ 2 = ^ 

3 ^ 3 - q 

^ ^ ^ ^ J 7 

3h 


OHSERVING Slum HABIIS 

Shortly after this first unit kst the teacher begins to appiaisc the pupils’ 
study skills and habits This is an actiMtv that is to spread over the entire 
semester since it involves two obscivations ol each pupil In recording his 
observations he uses the schedule shown in Fable H 1 his instrument serves 
as a reminder of what dimensions to appraise as well as a record lorm One 
sheet IS filled out for each pupil in the first hall and agun in the last half of 
the semester Ihcse aic filed in pupils folders and become the basis for class 
discussions and individual conferences 

Beginning at this time and continuing for the semester the teacher also 
rates the pupils’ .ittitiides toward certain stud) matters He does the rating 
on the obscrv'alion schedule (item 5) at the time ol the observation but bases 
his judgment on things other than what he sees just then 

In addition to attention span, which he measures by Oiiservation the 
teacher needs to appraise what he calls study span ” To do this he asks each 
pupil to keep a log on each of two assignments and turn it m with the assign- 
ments From the Jogs fsce Table 14) he is able to extract an average length 


Hdbit* 


Pupir$ name 


Inclusive time of 
observation 


4 


Li^*! study actiMties 


Place D lie 

1^ nonstudy actiMties 


^ IXsuibe ill sequence the pupils ictions when he st irchcs foi i miteiul oi pitcL 
of mtorm ilion 


4 Mtuilion pm Piek i kn mirute period when pupil lus stut d reading or wntint 
md IS likely to eontinue lime his itt nti\c letions ml hs non ittentiv ones and 
dnw a tnph to show how he spends his nine 


Attends fo tisk 


Minutes 

1 


4 
s 
n 
7 

5 
9 

10 


Nonitlentive to t isk 


287 



288 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


FABLE 13 (Continued) 


5 Attitude ratings 

likes 

Strongly 

Likes 

Indifferent Dislikes 

Dislikes 

Strongly 

Assigned reading 



1 

1 

Assigned writing 





Fasy assignments 

Hard assignments 



1 


Creative assignments 

New ideas and ways 



1 


Old ideas and wavs 



1 

j 



TABIE 14 



Pupil Study Log 



N line 

Dite 



Subjeet or issignment 





Started Finished 

Date 

Due 





Log of activities 
Dale 


No of Koiirs and/oi 
Minutes 


AcliviU 


of Study penod and interval between periods lor each pupil and, as well, see 
something of the quantity and variety ol the pupil’s study activities Where it 
IS indicated that a pupil is studying in an inelhcient fashion, he discusses the 
findings with the pupil and suggests changes These logs go into the pupil’s files 
with all the other measurement data 


SOCIAL STUDIES 


289 


Besides the logs, the teacher has the pupils describe themselves as stu- 
dents. About the middle of the semester he asks them to write as much or as 
little as they think is appropriate in answer to the question: “What kind of 
pupil am 1?” When the papers are turned in, he analyzes them by identifying 
and counting statements and phrases that clearly indicate each of the follow- 
ing: 


Self-con fidence 

or 

Lack of self-confidence 

High aspirations 

or 

Low aspirations 

Pride in work 

or 

No pride in work 

Persistence 

or 

Gives up easily 


While this is not a crystal ball, an analysis of a pupil’s view of himself as 
a pupil may disclose factors that are causing failure or maladjustment in 
school. Where needed, the teacher discusses what he finds with the pupils’ as 
a group or individually. 

His final measurement of study habits is made by inspecting the pupil’s 
reading and discussion notes. If a pupil makes no notes, this fact is recorded. 
If he has kept them, the teacher looks at a three-page sample at the middle 
and again at the end of the semester. He uses the follo^\ing self-directions in 
analyzing them. 

a. Classify each sentence or separate phrase as a quotation or a paraphrase, as 
expository or suggestive. 

b. Count the total in each category. 

c. Count the words in the three-page sample, omitting prepositions, articles, 
pronouns, interjections, and conjunctions. 

d. Ascertain from each pupil the number of pages of reading and/or minutes ot 
discussion the three pages represent. 

No total score is derived, but from such an analysis he is able to make 
a reasonable judgment about the value of note-taking to the pupils’ learning. 
If the judgment is adverse, and it often is for eighth graders, he discusses 
“what to do” with the pupil. The notes are returned to the pupils but an 
analysis slip goes into each pupil's file. 

MARKING COMPOSITIONS 

In the course of the semester, the pupils are required to write two com- 
positions on topics pertinent to the subjects they are studying and these arc 
used as a further means of measuring their understanding of history The 
compositions arc analyzed in much the same way as arc the unit tests. 

a. Assertions arc classified as to the level of understanding they appear to repre- 
sent and the number at each level is recorded. Each frequency is multipled by 
the number of its level. 

b. Assertions are checked for accuracy and the number of errors is recorded as a 
percentage of the number of assertions. 



290 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

The pupil’s score is the total of points with the inaccuracy index as a 
companion score. This score is used, as are the unit test scores, for determina- 
tion of group statistics. The now analyzed composition is compared with the 
standard for understanding and the composition is valued accordingly. 

MARKING NOTEBOOKS 

The third and last approach to evaluating knowledge of history is through 
an appraisal of the pupils’ notebooks. These arc compilations of quotes and 
paraphrases from the text, newspaper clippings, map work, etc. They are dis- 
tinct from the rest of the pupils’ work in that everything in the notebook is at 
the behest of the individual pupil, not the teacher. The same analysis and 
evaluation is applied to them as to the assigned papers plus these additions 

a. Pages are classified as pnrnaiily copied or sclf-dcvised. Self-deviscd pages are 

considered further items at level four ot five of the peitormance scale for undci- 
standing. 

b. The dilTerenl sources or types ot souice (text, encyclopedia, newspaper, etc ), 
are listed and the frequency ot each source is noted. 

c. Each topic or subdivision is rated as to thoroughness of coverage on a tne- 
point scale. 

Measurements b and c are entered on the reading notes analysis slip as 
further evidence of a pupil's study skills. 

Final Measurement of Reading and Attitudes. fhe final acts ol measure- 
ment for the prototype class are a rcadmmistration in the last week ol the 
semester of the two tests given in the first week, one for reading and the olhci 
for attitudes. These arc scored as were their earlier counterparts and com- 
pared with the earlier scores. Appropiriatc notations are then made in the 
pupils’ files. In the case of the attitude instrument, the teacher adds these 
scores and score differences to the same for previous classes and makes a 
frequency distribution of the final scores and also of the differences between 
the first and second scores. Pupils arc assigned percentile ranks for status and 
for change on the basis of these two distributions and their attitudes are eval- 
uated according to the quartile in which they tall 

REPORT CARDS 

With measuring and piecemeal evalihiting at an end, it remains for the 
teacher to make out report caids (the summary evaluation). As he works he 
has constant recourse to charts showing test distributions, to the statements ol 
standards, and particularly to the pupils' files. In general, the periodic evalua- 
tions of knowledge-understanding are the primary basis for any pupil’s marks; 
A, B, C, D, and F, corresponding roughly to the five performance levels ol 
the standard for knowledge-understanding. Evaluations of attitudes serve to 
raise or lower marks about which he is uncertain. More exact information as 
to them as well as to study habits is given to pupils separately, in the form 
of a note or by a conference. 



SOCIAL STUDIES 


291 


Comments on the Prototype Study 

Through a description of measuring and evaluating as a teacher might 
perform it for a class in Social Studies, we have tried to provide an “opera- 
tional” experience in measuring and evaluating. You may have noticed that 
the phenomena and dimensions the teacher measured were not just the same 
as those we earlier attributed to the Social Studies. Such difference from the 
average is likely to be true of any Social Studies class and, consequently, 
measuring and evaluating in any class will be a unique problem. However, our 
“teacher’s” approach and the substance of his procedures are thought to be 
generally applicable. 

While the prototype teacher's performance was intended to be within the 
scope of any well-educated and energetic teacher, he did engage in more ex- 
tensive evaluation than most teachers do Moreover, there is some over- 
simplification in the example, and by no means are all the problems of meas- 
urement and evaluation in eighth-grade Social Studies examined. Pupil and 
parent pressure for grades, ''much effort no gain” cases, the *'70 per cent is 
passing” tradition, and the matter ol the '‘normal curve” arc among the 
omitted problems. 

Summary 

Measurement in Social Studies typically is directed toward pupils’ “knowl- 
edge” of given subjects (or of the content of units), to their map and chart 
skills, to attitudes that relate to citizenship, and to their problem solving abil- 
ity. The measurable dimensions of these phenomena so far as practice is con- 
cerned arc largely the identity, number, and pattern of elements, e.g., the facts 
and concepts a pupil remembers, his map vocabulary, and the objects of his 
strong feelings. Dcscriplion-classilication and ranking are the princ pal forms 
of measurement. All measuring procedures are applied to Social vStiidies in- 
struction with various observation and Iree-response techniques being the 
more prevalent. 

There arc no generally accepted evaluative siandards for the Social 
Studies. In use are statements of objectives in courses of study, the content 
of textbooks, and the teachers’ ideals, all of which have serious shortcomings. 
In many cases, conscious attention is never given to a standard. Marks in 
Social Studies subjects are usuall) tlx letters A, B, C, D, or F. 'these fre- 
quently arc replaced by descriptive statements and even parent-teacher confer- 
ences in the primary grades ol many schools. In the Social Studies, perhaps 
more than elsewhere in the curriculum, citizenship is eviiluated and “psy- 
chological” grading is practiced. 

Jn the elementary grade Social Studies “unit,” evaluation involves a num- 
ber of special problems. Progress often is reckoned relative to ' pupil-planned” 
objectives. Work is diversified and thus pupils’ eflorts fiequently arc not 
comparable. Pupils as a group ma> evaluate themselves b\ discussion. Since 
there is little standard “content,” tests of knowledge usually are inappropriate. 



292 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


Little attention is given to analyzing disabilities in the Social Studies. 
However, observation and product or frec-response analysis can be valuable 
sources of diagnostic information 

In the prototype study of measurement and evaluation in an eighth-grade 
situation, the following were salient features The teacher planned his pro- 
gram, standards, dimensions, forms, and procedures before he undertook it 
During the semester he engaged in some sixteen measurement activities Many 
of these involved individual observations of each pupil and analysis of products 
The results were collected in a (older lor each pupil and were examined by 
the teacher at the end of the semester as a basis for his mark 


FXERClSbS 

1 Indicate several measurable dimensions for achievement m one or more of 
these subjects US History World Hislo'v, Geography, Contomporaiy Affairs 
and Problems, American Goveinment 

2 Wiilc a criticism of the cvaluilive activities employed bv the teacher in the 
prototype study 

f Prepare a free response test toi underst Hiding of current affaiis and prob 
lems Include a scoring plan of the factor lating or counting tvpe for the answers 
to each question 

4 Devise a guided response test for map reading and inter pictation 

5 Construct a vocabul try test tor tcims import int in the Socia^ Studies 

6 Prepare a standard for evdiiatinj; understanding of a specifie topic in the 
Social Studies that contains at le ist four levels eif pcifeiini ince Devise at leist one 
test Item appropriate to mcasiiiing pirforminee at cieh level 




CHAPTER 12 


SCIENCE AND MATHEMATICS 


If the essential characteristics and the primary contributions ol our West- 
ern culture were set forth, the activities and ideas embodied in science and 
mathematics would undoubtedly be high on the list. Our achievements m 
science and mathematics have helped to form our highly complex and tech- 
nological society and have enabled us to control our environment for our pur- 
poses. lYiday, we are made even more aware of the essential role of scientists 
and mathematicians in America as we read about their shortage in industry 
and government and our failure to train as many of them as other countries, 
l^hercfore, it is not surprising to find that the study of science and mathe- 
matics occupies an important place in our common schools. 

The two subjects arc sulficiently different in content and approach to 
warrant separate treatment. Ihe iundamental difference between mathematics 
and science lies in their approach. Science is a method ot inquiry that is 
primarily inductive, that is, reasoning from particular events to generalizations 
or laws. Mathematics, on the other hand, is primarily a method of deductive 
inquiry, that is, reasoning from general statements or axioms to specific state- 
ments. 

I'hc relationship between science and mathematics has not always been 
clear, either historically or educationally. There are those who maintain that 
mathematics should be subservient to science, and as evidence for their stand 
they point to the cvolvemcnt of mathematics from man’s social, physical, and 
religious needs. This point of view has certain educational implications. In 
schools where it prevails, mathematics would not be taught separately but 
would be included as a part of other subjects when needed; or at best taught 
in such courses as business mathematics, shop mathematics, or consumer 
mathematics. 

It is true that mathematics evolved from man’s everyday activities and 
has been and still is an important tool for science. In the nineteenth century, 
however, the deductive nature of mathematics was rediscovered from the 
Greeks and since that time mathematics has been making contributions in its 
own right. A rather spectacular example of the essential difference between 
mathematics and science, and of the importance of mathematics per se, is 
provided by the geometry of Euclid. This was developed some 2,200 years 
ago but is still being studied and used today whereas most of the scientific 

294 



SCIENCE AND MATHEMATICS 


295 


theories of Euclid's time arc now known to be erroneous. A deductive scheme 
(mathematics) may stand fairly intact for a considerable length of time while 
an inductive scheme (science) is constantly undergoing change. The title of 
Eric Temple Bell's book, Mathematics, Queen and Servant of Science, is 
suggestive of the present outlook on the relationship between mathematics and 
science. Mathematics provides the rational organization for science and hence 
is the Queen of Science. At the same time, in the applications of science, 
mathematics is the tool and hence the Servant of Science. 

In this chapter, evaluative procedures in science and mathematics arc 
discussed separately. I hesc separate discussions, however, arc combined into 
one chapter because of the common problems of evaluation involved. These 
common problems stem from certain similarities of content. Both subiccts re- 
quire special terminology and precision of statement. Both subjects are highly 
systematized and the ideas involved arc closely rclalcd and interdependent. 
Moreover, mathematics and science are mutually invoked in many aspects of 
know'ledge and human activity. 

For each subject we first shall outline brielly its content throughout the 
grades and identify some ol the more important ob|ectivf.v and dimensions ot 
pupil achievement. Following this, we shall propose some evaluative standaids 
and describe several measuring procedures appropriate both to them and *o 
the dimensions of the subject. Finally, \\c shall indicate anv special problems 
or considerations involved in the measurement and evaluation of either sub- 
ject. 


SCIENCE 

Science has two major aspects. First, it is a body of s>stcmatizti knowl- 
edge that is useful and pr ictical, Secondlv, it i** a process or approach knouii 
as the “scientific method. ' 'Fhese phases now are assuming equal importance 
in the schools, whereas in the past the first had been overstressed at the 
expense of the second. Certain scientific facts, concepts, and principles should, 
of com sc, be a part of basic education so that everyone mav live more effee- 
tively in his natural environment. At the same time, it is equally essential foir 
children and youth to learn the method or approach science uses in si>lving 
problems, acquiring information, and developing laws. 

All scientific activity centers around the problem ol comprehending and 
controlling man's environment. Very often the products and end results of 
scientific activity overshadow^ everything else concerning science. To many 
persons, science consists primarily of all the interesting gadgets and machines, 
new treatments and processes, new materials, and miracle drugs, all of which 
arc given extensive publicity. The real essence of science, though, is found in 
the activity that produced these spectacular results, and this scientific activity 
is characterized by the following: 



296 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

1 . Its effort toward accuracy and precision in measurement and descrip- 
tion. 

2. Discrimination, leading to analysis and classification. 

3. Making comparisons and establishing relations that lead to the formula- 
tion of underlying principles. 

4. Testing hypotheses experimentally and analytically. 

5. Developing systems of related ideas using mathematical systems as 
guides. 

So far, we have been discussing science rather than the sciences. We arc 
all aware of the various subdivisions of science taught in school: physics, 
botany, astronomy, geology, chemistry, and biology. For our purposes, how- 
ever, we shall concern ourselves only with the two broad classifications of 
science, namely, the physical sciences and the biological sciences. In the cle^ 
mentary school no special distinction is made among the various sciences and 
it is not until the secondary level that the branches of science are treated as 
subjects in themselves. 

Science instruction in the modern elementary school centers about the 
firsthand experiences and the immediate environment of the children. Topics 
include such items as the changing weather and seasons, clothing, various 
plants, animals, and other living things, fire, magnets, health, various materials 
and their uses, and food, fn earlier times, the science program in the ele- 
mentary school was concerned mostly with identifying and cljj^ssifying things 
and with observation. Use of the experimental approach and problem solving 
methods was thought to be too advanced for elementary school children 
Today, however, the elementary school science program has been extended 
to encourage the children to relate and compare their observations and to use 
the scientific approach in studying materials at their le\el. We find that grade- 
school children can carry out simple experiments and proceed systematically 
to solve problems that are significant to them. 

At the secondary school level, ordinarily a biology course or a general 
science course is required of all pupils From there on, some specialized 
courses in science are offered that may be taken as electives or are required 
for a college preparatory program. Such courses include chemistry, physics, 
physiology', photography, and agricultural science. 

Measurable Dimensions in Science 

Formulating the many dimensions involved in science instruction is a 
formidable task for any new teacher. Fortunately, one can turn to the previous 
efforts of several committees and groups of persons who have given consider- 
able thought and time to identifying the objectives of science teaching.^ The 

1 For a more comprehensive summary of individual and committee efforts in identify- 
ing the objectives of science instruction, the student is refened to Chapter il of Heiss, 
Obourn, and Hoffman, “Modern Science Teaching” (7). 



SCIENCE AND MATHEMATICS 


297 


objectives stated by deliberative committees are tantamount in most cases 
to the basic measurable dimensions of pupil achievement. Sometimes they 
need to be restated in terms of pupil performance. One such group identified 
the following types of objectives for science teaching (10) : 

A. Functional information or facts about such matters as: our universe, 
living things, the human body. 

B. Functional concepts such as: space is vast, the earth is very old, all 
life has evolved from simpler forms. 

C. Functional understanding of principles such as: all living things re- 
produce their kind, energy can be changed from one form to another. 

D. Instrumental skills such as the ability to: 

1. Read science content with understanding and satisfaction. 

2. Perform the fundamental (arithinelical) operations with reasonable 
accuracy. 

3. Perforin simple manipulatory activities with science equipment. 

4. Make accurate measurements, readings, etc. 

E. Problem solving skills such as ability to sense and define a problem, 
select a tentative hypothesis after studying facts and clues in the situation, lest 
the hypothesis experimentally, and accept tentatively or reject the hypothecs 
and test another, draw conclusions. 

F. Attitudes such as open-mindedness, intellectual honest), and sus- 
pended judgment. 

G. Appreciations of the contributiems of scientists. 

H. Interests in science as a recreational activity or as a field tor a voca- 
tion. 

The science teacher may use this or a comparable list as a starting point 
for identifying the dimensions he wishes to measure. Such a list is certainly 
not to be considered final and should be modified to suit given circumstances. 
For in.stancc, at the elementary school level the teacher may wish to consider 
an additional factor, the pupil’s awareness and sensitivity to surrounding 
natural phenomena. Some youngsters arc keen observers and seem never to 
miss anything that goes on around them, while others seem nearly oblivious 
to their surroundings. 1 his sensitivity and keenness arc important aspects in 
science. Another aspect a teacher may wish to appraise is ability to describe 
accurately and to formulate precise statements about what is seen. 

Any original listing of dimensions needs to be reviewed to determine if 
any important aspects of achievement are neglected or if there are any dimen- 
sions that could be more sharply delineated. If wc compare the above list with 
the list of measurable dimensions common to all education phenomena (see 
pages 29-30), we find that the dimension of identity and number of com- 
ponents listed in Chapter 11 is essentially equivalent to the aspects of facts, 
concepts, and principles presented in the list. However, the dimension, organ- 
ization of components, mentioned there is not apparent m the list. Conse- 



298 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


quently, the teacher may wish to add such an item as ability to organize ideas 
and to see their interrelationship. Then, too, the sixth of the listed objectives, 
“attitudes such as open-mindedness,” etc., certainly must be rephrased if it is 
to be measured. 

It may at times be desirable to devise dimensions from a specific point of 
view. A good example of this is provided by a group (2) who organized their 
list ol dimensions around tlie reading of scientific material, this being merely 
a subheading in the set of dimensions just presented. 1 hey set up a list based 
on any topical reading material in science. Portions of this list are provided 
below: 

1. "'Prohie/ns. Ask a question which requires the student: 

(a) to identify the problems to which the statement gives the answer and 
to recognize the central problem to which a number of statements are ad- 
dressed; 

2. Information (data, laws, principles). Ask a question which requires 
the student: 

(a) to recognize when the information he possesses is inadequate for a 
given problem; 

(b) to indicate kinds of sources of inlormation appropriate for a given 
problem; 

3. ''Hypotheses. Ask a question w'hich recjuircs the student. 

(a) to formulate or recognize hypotheses based on given •data or situa- 
tions; 

4 "Conclusions. Ask a question which requires the student: 

(a) to recognize the generalization { s ) involved in an intcrpreiation or 
conclusion; 

(b) to detect the unstated assumptions involved in a conclusion; 

5. "Attitudes. Ask a question which requires a student: 

(a) to recognize in a paragraph or statement proper or improper use of 
such concepts as causality, teleology, simplicity, consistency, tentative nature 
of truth, and operalionalism.” 

The dimensions just listed overlap a great deal with the dimensions listed 
earlier on page 297. However, it is apparent that organizing the objectives of 
measurement around scientific reading material docs place a restriction upon 
their scope. 

Ways of Determining Dimensions. From our discussion so fai, there 
seem to be some four ways for a teacher to develop measurable dimensions, 
aspects, or objectives in science instruction: 

1 . Use an already prepared list and make adaptations where necessary for 
local needs. 

2. Identify the items as they currently and persistently arise in science 
instruction. 

3. Compare various proposed lists of dimensions and check for dis- 



SCIENCE AND MATHEMATICS 


299 


crepancies. From these lists develop a composite list and check for improve- 
ment in delineating the dimensions. 

4. If special circumstances require it, take a proposed list or composite 
list of dimensions and reorganize it from a diflerent point of view. 

In devising, selecting, or adapting the aspects of achievement in science 
which arc to be appraised, it is essential to keep in mind the conditions which 
dimensions must approximate if they are to be measurable. The requirements 
for measurability are discussed in detail in Chapter 2, pages 19-26. Fur- 
thermore, what is to be measured should be known by pupils and hence must 
be stated so that they can understand it. 

Evaluative Standards in Science 

Once the dimensions of achievement have been determined, the next step 
in the evaluation process is to formulate appropriate evaluative standards. 
As was the case with Social Studies, there seem to be no well defined and 
generally agreed-upon standards available for use by the science teacher. In 
many cases the objectives of instruction are themselves the standards. For 
example, in a chemistry class, the teacher may want all his pupils to learn 
the names and atomic numbers of the basic chemical elements. To recite all 
these elements and their atomic numbers correctly constitutes the highest 
value, to recite less than all of them correctly has less value. In other cases, 
the mean performance of a class is taken as the basic point of reference for 
evaluation, y^chievcmcnl that exceeds the average is marked as meritorious 
and that which falls sliort (T the average is given a negative rating Finalh. 
as in many school subjects, evaluation too often is performed without retercnce 
to any known standard. I he teacher simply feels that the student should have 
a high grade or a low grade and that’s all there is to it. 

Our basic viewpoint about evaluative standards is presented in Chapter 9 
and, as you may recall, wc consider a scale or hierarchy of performance to be 
the most valid standard for use in evaluating pupil achievement in school 
subjects. In science instruction, the teacher will need to develop such a per- 
formance scale fcir himself since there are no published ones available. A 
pcrfoimancc scale is simply a statement ot several gradations <’>f behavior 
with respect to the aspect of achievement in question, w'hich range from that 
which is of no value to that w'hich is considered to be of highest value. Pupils 
are judged according to the level their actual performance approximates 

A generalized variation scheme is described on page 2(4, which may be 
applied to understanding or knowledge of any subject Here we shall not 
attempt to define performance scales for all dimensions of science. This is 
something tliat each teacher needs to do himself for the aspects of achicvc- 
incm he wishes to measure. We shall illustrate the process by selecting ong 
dimension unique to science and by developing a performance standard for it. 

The dimension wc have chosen for the example is “the ability to formulate 
a scientific experiment that will answer a question or lest a hypothesis.” This 



300 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


objective has been selected because it is relevant to both the elementary and 
the secondary school levels. The illustrations are drawn from the topic of 
magnets, which is often discussed in the upper elementary grades and, of 
course, is important in general science and physics at the secondary level. 
Ordinarily the children read about what magnets can do and then they arc 
given the opportunity to do various things with magnets. To gauge their 
ability to use an experiment to answer questions about magnets, the teacher 
raises this question: How can you tell which of these two magnets is stronger 
than the other? The pupils are then required to formulate an experiment that 
will answer this question or test the hypothesis that Magnet A is stronger 
than Magnet B. In judging their responses the teacher uses the performance 
scale shown in Table 15. 


TABLR 15 

Example of an Evaluative vStandaid for Science 


Objective or dimension on which the pupils are to be evaluated: I he ability to formulate 

an experiment that will an^^wer a given question or test a given hypothesis. 

Question: How would you find out which of these two magnets is the stronger? 

Le\cl / 

Performance: Suggests an experiment that misses the point, cither it doesn't solve any- 
thing or it attacks another question. 

Illustration; Pupil suggests that they hammer on the magnets, an expciimcnt which 
would measure the 'strength of nuttciial in the magnets but not the strength of their 
magnetism. 

Pupil suggests seeing how hard it is to pull the two magnets apart. 

Lcs cl // 

Performance: Suggests an experiment that points toward the answci but it is unrealistic 
and fantastic in its arrangements, or one that does not control all the variables, so 
that there still may be some doubt about the answer. 

Illustration: Suggests an elaborate arrangement for measuring distance at which magnets 
attract objects, but fails to account for variation in fiictional surface and variation 
in the material contained in (objects. 

Level in 

Performance; Suggests an experiment that is simple, direct, and realistic, and would give 
a clear-cut answer to the question or provide a cleai-cut basis for accepting or re- 
jecting hypothesis. 

Illustration: Suggests having the magnets pick up iron objects of the same shape hut of 
different weights. Ihc magnet picking up the heavier object would be the stronger. 



SCIENCE AND MATHEMATICS 


30t 


These performance levels cited arc designed as evaluative standards for 
elementary pupils. With refinement, though, they may be pertinent to second- 
ary grades as well. 

Forms and Procedures of Measurement in Science 

Forms, As we observed in the first chapter, the form in which measures 
are to be expressed is primarily a function of the dimensions appraised and 
of the standard to be applied to its evaluation. In science instruction the bulk 
of the dimensions lend themselves only to classification-description or to 
ranking. The precise units and fixed reference points necessary for scale 
measurement are not possible at present except for a few minor aspects of 
science achievement — rate of performance for one. Since the standards we 
have described as most appropriate for science evaluation are themselves 
classification schemes, the use of classification symbols would seem to be most 
appropriate for measurement. Of course, pupils may be ranked relative to any 
performance level, or, by interpolation, relative to the whole range of the 
performance scale. For example, five pupils judged to be performing at the 
third level of the standard shown in Table 15 could be assigned rank among 
themselves. If the three levels were considered merely three designated points 
on a continuum of pcrfomiancc, thirty pupils might be ranked in terms of 
their position on the continuum. 

Applicable Procedures. The many types of procedure for measuring 
behavioral phenomena — observation, product analysis, free response tests, 
guided response tests — are described in detail in earlier chapters (III-VI). 
Our presentation here is limited to a few applicalions of these procedures to 
scientific subjects, particularly to those test items and observational methods 
that can appraise the critical thinking and problem solving aspects t f science. 
(For more comprehensive survey ot measurement in science, see the 45th 
and 46ih Yearbook of thj National Society for the Study of Education.) 

GUIDED RESPONSE 11 I MS 

So far as knowledge of facts or remembrance of nomenclature is con- 
cerned, all types of guided response items are appropriate: true-false, multiple- 
choice, matching, arrangement, labeling, short answer, etc. All these are 
illustrated in Figures 9, 10, 11. Two of the most common types of science test 
items are shown in Figure 50. 


An ant is a (an) The three types of levers arc: 

a. amphibian 1. 

b. insect 2. _ _ 

c. mammal 3. 

d. reptile 


Figure 50. Two usual types of guided response items used in science. 



302 CUSTOMARY USES OF NJEASUREMENT AND EVALUATION 

Since these kinds of questions are easily constructed, it is understandable 
that questions of a factual type often predominate in science tests. 

During the past twenty years under the leadership of such men as Tyler, 
Raths, Noll, and Zechiel progress has been made in developing procedures 
that show promise of extending the measurement of science beyond the fac- 
tual dimension to that of scientific thinking. Among the '‘new” dimensions it 
is now possible to measure are identification of cause and effect relation- 
ships, detection of unstated value judgments, recognition of valid and invalid 
reasons, determination of whether or not statements can be verified, and 
perception of limitations of data. In Table 16 we illustrate testing procedures 
appropriate to these newer aspects of science achievement. 

lABLH 16 

Examples of Proceduies for Measuring Ability to Think Scientifically 


Illustration 1. Identifying valid cause and cfTect relationships. 

Below are listed pairs of events. In the blank space between the two events, write: 

A. If the first event is the cause or contributing cause of the second event. 

B. If the first event is the effect or result of the second event. 

C. If there is no cause and effect rel.ttionship 


First Ivrtit 

1. C'louds are moving in the skv 

2. A weight is placed on ihc end 
of a suspended spiing. 

J. Johnny’s eyes water. 

A. Walking under a ladder. 

5. The sound from a drum. 

(note: Cause and effect relationships i 
exercised in setting up this type of device.) 


Second Event 
Mcav> dust is in the air. 
Ttie spring stretches. 

Iidinny’s feet sweat. 

(jetting hit on the head 
with a paint bucket. 

I he beating of a drum, 
be very complex and caution should be 


Illustration 2 Detecting unstated assumption necessary to reach certain conclusions. 

A certain city reported that dining the >car 1956 there were 1,216 men drivers in- 
volved in auto accidents while only 127 women drivers were involved in car accidents. 
Conclusion: Women are safer drivers than men. 

What important assumptions arc necessary in order to arrive at this conclusion? 

(note: This could be a f ree-response question, oi a list of statements could be pro- 
vided and the pupils asked to check the ones they believe to be necessary assumptions.) 



SCIENCE AND MATHExMATICS 

TABLE 16 


{Continued) 


303 


Illustration 3. Rccogni/ing valid and invalid leasons 

Smith and Tvler and the B valuation Staff in their book Appiaisinii and Recording 
Student Progress (15), have made a careful analysis of this dimension Their proceduie 
for measuring it consists of describing a situation involving natural phenomena, first 
asking the pupils to give their answer to the pioblcm in the situation and tlun lequiiing 
them to check the valid reasons for then answeis from a list of statements As one 
example, they describe a sitiuition m which during warm weather persons not having 
lefiigerators often wrap their bottles of milk in wet towels and keep them whcie thcie 
is good circulation of air Ihe pupils are first asked if this would be effective and then 
they are asked to cheek the reasons for their answer 

In their analysis, they ha\e identified the following categonos of wrong reasons that 
might be present along with the valid reasons for consideration by the pupils 

1 True reason but irielc\ant to the problem 

2 Ideological statement indicating some pi cdLtei mined dts 

'1. Ridiculous statement 

4 Assuming the Lonclusion 

•> Unacceptable analogy 

6 Unacceptable authorit\ 

7 Unacceptable common pi iciist 

8 Supcistition 

\ teacher of sLitnec should be able to u^c this me i inng puKeduie it dl lc\cb \ 
considerable number ol commonplace obsers ilions cm he c\pl uned b> pupils on IHl 
basis of piinciplcs learned in Ihcir science classes An cs miplc of how the dc\icc may be 
applied to another obscivation is provided as follows 

Obscivation It tikes Icuigcr to cook foou in boiling w itcr <it m elevation of 0 OOO 
feet than at sea level 

Diiections Below is a list of statements that attempt to explain the ab('\e obseiva 
tion Check the statements th it are v did expl.i lations 

1 Ihe rclitne humidity is less at Inch altitude than at ''ci level 

2 I ood wcichs les at high altitude 

^ Atmospheric presMiic dccreiscs with iiieiease m altitudL 

X People ‘'hould not try to live in high iltitudc places 

s Icmpeiaiuics aie lower at higher altitudes 

6 As the atmosphciK piessuie decreases, the boiln p point of water dec ciScs 

7 Honem iking experts recorninei d th it foods should be boiled longer it 
higher altitude'* 

Many other items similar to this example could be developed Ihe appo>aeh could 
also be used in connection with clissroom demonsti itions bv hi\ing pupils (heck the 
vilid reasons for what had happened in the demonstration 

Ulustfaticn 4 Determining if i statement can be verified sueniitie ^lly 

Directions Toi each of the statements listed below, writi 
\ If the statement c.in be scientifically verified 



304 


CUSTOMARY USES Of MEASUREMENT AND EVALUATION 


TABLE 16 (Continued) 


B. If the statement is a theory or hypothesis and hence cannot be scientifically verified. 

C. If the statement contains a value judgment and hence cannot be scientifically verified. 

D. If the statement is a definition and hence is not subject to scientific verification. 
1. The eye is an organ of a human body. 

2. The evaporation rate of this pan of water will increase if heated. 

3. Molecular action in a body of water will increase if heated. 

4. The earth travels around the sun. 

5. Mosquitoes should be destroyed because they arc disease-bearing. 

6. Osmosis is a process by which the molecules of a substance pass through 

cell membranes. 


If students can sort out those statements that can be verified and those statements that 
cannot, and tell why, they arc on their way to a good understanding of science. 

lliustration 5. Ability to perceive relationships in scientific data and to perceive 
the limitations of scientific data 

These two abilities have been carefully analyzed by Smith and Tyler and the Evalua- 
tion Staff (15) as aspects of a broader ability, namely, the ability to interpret data. 
They have determined a scheme for classifying students' interpretative statements which 
are indicative of their perceptiveness for relationships and again of their awareness of the 
limitations of data. 

Relationship interpretations are ckissified as reading points, comparison of points, 
cause, effect, value judgment, lecognition of trend, comparison of tiend, extrapolation, 
interpolation, sampling, and purpose. 

Classifications foi the second ability, perception of limitations, are as follows; ac- 
curate, ovcrgcneralization (goes beyond data), undergenerali/.alion (overly cautious), 
and crude eiror (missed the point entirely). The following exemplifies the measure- 
ment of these abilities in terms of these classifications. 

Data: A national survey of accidental deaths was made in the U.S. with the follow- 
ing results reported by types of accidents. 


Types of Ac rid cuts 

Motor vehicle 

Railroad 

Air transport 

Falls 

Burns 

Drowning 

Firearms 

Poisoning 

Miscellaneous 


Percentage of 
Accidental Deaths 

36 

4 

0.5 

24.5 

6 

7 

3 

3 

16 



SCIENCE AND MATHEMATICS 


305 


TAfiLF 16 (Continued) 

Directions Below arc some statements that arc interpretations of the above data. 
For each statement, >\rite 

T — If the statement is a reasonable interpretation of the above data 
U — If there is insufficient evidence supporting the statement 
F — If the above data contradict the statement 
(Reading points) 

__ 1. 36 per cent of the people in the United States have cars 

2. 6 per cent of the fatal accidents involved burns 

(Comparison of reading points) 

3 Moie people ride the train than the airplane 

_ 4 Airplanes are safer than tiains 

5. Thcie were more accidental deiths by diowning than b> burns 

(Cause) 

_ 6 Accidental deaths b> diowning arc the result of swimming in unsafe places 

7 Fatal accidents from falls aie the result of ciielcssness 

( Fffect) 

8 If all car driveis weie moic caicful, thcie would be fewer fatal accidents 
(Value judgment) 

9 Possession of firearms should be severely restiictcd 

_ 10 A campaign should be earned out warning people about fat il accidents 

fiom fillinp 

Students answers aie compaicd with the kev answeis and each inswer is tudged as 
accLii ite btvond dat i, overcautious or crude ciior iccoiding to the following scheme 

Kc> Answci 

Student s 

answer T U I 

r Accurate Bc>ond data Crude eiior 

U Overcautious Accurate Overcautious 

r I Crude eiior Beyond data Accurate 

Ihese comparisons would give .m indication of the student's ability to lecogni/c the 
limitations of data To measure his pcrceptivcness of various types of relationship among 
data, he could be lequired to indicate the type of relationship each statement involves 



306 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

So far, we have confined our discussion to guided response test items. The 
many possibilities of this type of measuring device have only been suggested 
here and the interested person is encouraged to investigate the other sources 
suggested earlier in this chapter. We shall now turn briefly to another category 
of measuring procedure, observation, and the use of anecdotal records and 
rating procedures in connection with it. 

OBSERVATION 

Direct observation ol the pupils as they go about their activities in science 
is a particularly valuable procedure in appraising the status of pupils in science 
classes. Earlier in the chapter it was mentioned that awareness of and sensitiv- 
ity to one’s physical and biological environment is an important dimension 
of achievement in science. In informal class discussion it is possible for the 
teacher to note that some pupils aic much more observant than others. This 
may be made a matter of record by writing short summaries of exactly what 
happened. A second-grade teacher, for example, might record the following 
after a class discussion on plants. 

When the class was asked to tell how some plants were ditTcrcnt trorn others, 
Eric was able to list about ten diflercnt ways, some ot which were rather subtle. 
For example, he mentioned that some plants had bright shmy leaves, that some 
plants had soft stems, and that some plants hud single stems coming from the 
ground while others had more than one stem coming up lioin thc^round. 

An accumulation of such observations can he the basis lor evaluating 
different children in regard to their sensitivity to things around them. Recorded 
observations are also uselul in appraising such factors as scientific attitude 
and scientific approach to problems. The teacher would need to record in- 
cidents that indicate Mary’s open-mindedness or Frank’s rigidity in regard to 
new information. He also should record how Bob set about finding out why 
a certain crystal radio kit wouldn’t work. 

Check lists, as mentioned in Chapter 4, page 52, are frequently u^cd in 
observation to direct the observer’s attention to various facets of the thing 
being measured. Such general dimensions as scientific attitude and the ex- 
perimental approach to problems have many aspects. Scientific attitude, for 
instance, comprises such items as a willingness to accept the tentative nature 
of explanations, an avoidance of becoming “ego-involved” in the experimental 
process, an unwillingness to distort or to compromise the truth in any way, 
and an unwillingness to jump to conclusions on the basis of insuflicient evi- 
dence. These are only a few of the elements of a scientific attitude and their 
observation is made more efficient if the teacher has a check list of them. In 
addition, use of a rating scale for each aspect (sec page 56) may enable 
more systematic and comparable appraisal of pupils. 

Projects of various sorts are an essential part of science instruction. These 
include collections of rocks, insects, and plants; the preparation of displays 



SCIENCE AND MA I HEMATICS 


307 


showing life cycles or working parts ot machines, the construction of 'models, 
radios, electric motors, and other mechanical and electrical devices; and the 
reports of field trips 

The teacher might appraise these projects by using a check list and/or 
rating scale that contains such factors as originality and crcativcness, scientific 
ipproach, colorful display, and clarity In laboratory courses, the student’s 
ability to set up the necessary equipment lor cxpeiimcnts is an important 
aspect of his achievement In appraising this ability, consideration should be 
given to such aspects as proper choice of equipment, manipulative ability in 
setting up the apparatus, neatness, and stability, and the over all effcLlneness 
ol the apparatus to do the job it was set up to do 

I he illustrations just presented should indicate that direct observation 
has an important place in measiinns and evaluating the competence ot pupils 
in science classes 

STANDARDIZLD ILSTS IN SCILNCl 

The standard critical rctercnce for published standardized tcst^ is the 
Mental Measim merits Y earhook edited by Buros (1) 1 he following number 
of standardized tests in science arc listed m the fourth edition of the Ycai 
hook 


\nriih(i of Stand 


Suhj{( t anil i d Tc srs 

riemcntiTN Sliciicc 

Gcner il Proficiency in Science 4 

Biolou;\ 1 ] 

( hcmisti) 

Gcncnl ScKnce 7 

Cieolog\ I 

Ph>sics 1 1 


I lom the above list we can readily see that nearly all the t<.sts are at the see 
ondary or college level and that practically three lourdis of the tests ire in 
chemistry, biology and physics 

An inspection of some ol the more recent tests in science reveals tiit ai 
effort IS being made to extend the coverage of these tests to include other than 
factual objectives lor instance, m the Mcnual of Directions for a biolocv 
test the lollowing statement is made (Appendix B pane 490) 

This test has bc'cn developed pnm inly to niea'urc urWerslanding and the ability 
to apply knowledge in the interpret ition ot situ itions and the solution ot prob 
lems Testing of ability to recall minute isolaUd f icts has been niinmiized R ither 
the student is given an opportunity to demonstrate how veil he can discern icla- 
tionships between whit he has learned ind the world of living things which he 
encounters every dav 



308 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


Tesis such as this, which have extended their questions to include the 
application of principles and interpretations of everyday situations, are con- 
sistent with the trend of best practice in science instruction. 

Standard tests have much to offer the science teacher, particularly at the 
secondary level, if certain requirements are met. In the first place, they ordi- 
narily are prepared by experts in the field and much thought and study go into 
the construction of each item. Secondly, a standardized test has been tried 
out and revised to meet certain criteria of difficulty and internal reliability. 
Finally, norms usually are available so that a teacher may compare the stand- 
ing of his class with other comparable groups of pupils. 

Precautions in Use of Standardized Tests. Before these advantages can 
be realized, however, certain requirements must be met. 

1. The objectives and topical coverage of the standardized test should be 
approximately the same as those of the teacher’s class 

2. 1 he students on whom the test was standardized should be comparable 
to the students in the teacher’s class. I he norms of a test that was standardized 
on pupils of a rich urban school district arc not likely to be applicable to pupils 
in an impoverished rural district. 

3. The test should be constructed and validated according to prevailing 
best practice. Ihis may be checked by an inspection of the test’s manual and 
by reading critical reviews relative to it in journals or in the Mental Measure- 
ments Yearbook (see Chapter 16, pages 426-427), for a discussion of the 
selection of standardized tests). 

Special Problems in Science Measurement 

As a final consideration, it necessary to mention several problems in 
measurement and evaluation particular to science. 

1. Although excellent work has been done in identifying the various 
aspects of science instruction that should be measured, selection of dimen- 
sions for any given class and the development of a standard in terms of levels 
of performance are left to the teacher in most cases. 

2. The science teacher in particular is confronted with the task of evaluat- 
ing the ability of pupils to carry out a scientific approach to problems and 
their scientific attitudes. From previous discussion it is apparent that this is 
a difficult task. In order to do an elTectivc job, the teacher may need to plan 
some controlled situations that will allow for observation of performances 
keyed to these dimensions. 

3. Ordinarily, the elementary school science program includes a multitude 
of experiences for the children, science tables, field trips, free reading as well 
as assigned, and a great deal of incidental relation and discussion. Conse- 
quently, the elementary teacher cannot evaluate the pupils’ learning of a well- 
organized body of subject matter. He must isolate the “scientific” aspects of 
achievement in all this variety of activity and devise means of evaluating them. 
For this purpose, observation and the maintenance of cumulative records 
probably are the best recourse. 



SCIENCE AND MATHEMATICS 


309 


MATHEMATICS 

Wc come now to a consideration of measurement and evaluation in mathe- 
matics. In general, wc shall follow the same outline that was used for science. 
First, the nature of mathematical activity, both in general and in the schools, 
will be discussed. The broad dimensions of the subject will be pointed out 
next and some specific breakdowns of these dimensions will be exhibited. 
Following this, we shall develop sample evaluative standards for several of 
the dimensions as well as some appropriate measuring procedures. A brief 
discussion of standardized tests in the field and of some of the special prob- 
lems involved in the measurement of mathematics instruction will bring this 
section to a close. In many instances, what has been said earlier in this chapter 
in connection with science applies here also. 

Nature of Mathematical Activity 

Mathematics is in nature a highly symbolic subject. Its origins are found 
in the development of symbols by our primitive ancestois to represent quan- 
tities and collections of objects. To deal with quantities, various abstract num- 
ber systems were developed, culminating in the present-day Hindu-Arabic 
system, which is especially abstract in that it involves the idea of positional 
notation In order to cope with quantitative problems, such operations as 
addition, multiplication, subtraction, and division were dc\ised. These opera- 
tions are even more abstract than the s\mbols and are very closely regulated 
by a set of rules. Paralleling the development ol a symbolic scheme for han- 
dling quantitative relationships, arithmetic and algebra, wa-> the development 
of a symbolic scheme for liandling space relationships, geometry. It was in 
geometry that the Greeks discovered the deductive nature of mathematics. 
However, the full implication of their discovery was not realized until some 
two thousand ycar^ later. So until present limes, mathematics has been con- 
sidered primarily a useful s}mbolic tool for dealing with quantitative and 
spatial problems, and it has been characterized tsscntially by manipulations 
of these symbols. 

Today there seems to be a strong movement toward a proper emphasis on 
the long overlooked deductive nature of mathematics and toward a better use 
of its potentialities. When the content and method of mathematics arc care- 
fully examined, it is apparent that mathematics is an idealized language, 
permitting the greatest possible precision in thought and operation. Intuitive 
ideas are given symbolic form, and the relations and operations among these 
ideas arc identified and arc likewise given symbolic form. Starting with a few 
assumptions and using certain basic rules of logic, a symbolic system is de- 
veloped that has myriad ramifications and applications. This system may 
happen to be arithmetic, algebra, geometry, analytic geometry, calculus, or 
topology. The method is the same for all. 

This increasing emphasis on the deductive nature of mathematics does not 



310 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


imply, however, that the instructional methods in teaching mathematics should 
be formal and deductive. On the contrary, the deductive structure of mathe- 
matics can be taught most successfully by using informal inductive methods. 
This has been shown to be true particularly in the “meaning approach” used 
in arithmetic. Here the students arc provided experiences through the manipu- 
lation of concrete and semiconcrele materials, and arc encouraged to discover 
for themselves the basic principles, rules, and generalizations that form the 
deductive scheme of arithmetic 

To conceive of mathematics as an idealized and precise language allows 
a broader application of its principles and operations. Heretofore, the primary 
purpose of inathciiiatics instruction has been to teach the symbolic manipula- 
tions necessary for the solution of man’s physical and social problems. This 
narrow utilitarian view is now being replaced by a much broader viewpoint 
which still emphasizes the application of mathematics Mathematics can be 
used to facilitate thinking in general. Its symbolic systems are models of sys- 
tematic, precise, and logical thought. Other subjects and areas cf activities 
use mathematics as a guide for organizing their findings into useful and effec- 
tive systems. 

Mathematics has the same potential use for the individual as well. A 
person studying mathematics not only can acquire a set of highly useful 
manipulative techniques but also he can acquire a highly usciul model for 
precise expression of symbolic construction, for identifying relations between 
ideas, making valid logical inferences, and m general for organizing and using 
his ideas and experiences. Mathematics can provide a vital experience in how 
ideas and concepts are interconnected, interrelated, and interdependent, and 
how the usefulness of concepts depends upon the structure of which they are 
a part. Through mathematics it can be shown clearly that ideas do not stand 
alone and that a hodgepodge of ideas is nearly useless. This broader utilitarian 
view of mathematics instruction is gaining w'ider acceptance in our schools 
today. 

Measurable Dimensions in the Study of Mathematics 

Once again, as in science, it is possible to refer to the woik of a number 
of individuals and committees as a source for the measurable dimensions of 
mathematics instruction (see bibliography at the end of the chapter). For 
convenience, vve have arbitrarily established four broad categories of dimen- 
sions : 

1. Computational and manipulative techniques. 

2. Concepts and principles involved in quantitative and spatial relation- 
ships. 

3. Logical structure. 

4. Application to mathematical problems. 

Within each of these categories are found numerous specific items of pupil 
achievement relative to mathematics. 



SCIENCE AND MATHEMATICS 


311 


COMPUTAirONAL AND MANIPULATIVE TECHNIQUES 

In general, this dimension might be described as: “The ability to perform 
certain computational and manipulative techniques accurately and with facil- 
ity.” The number of distinct computational technitiucs covered in arithmetic 
arc too numerous to list, as are the varied manipulations of algebra, geometry, 
tiigonometry, etc. Consequently we shall cite only a few examples. 

Counting: number building, decomposition of number. 

Adding: basic combinations, grouping, carrying. 

Subtracting: basic combinations, borrowing, take-away, equal-additions, and 
complementary methods. 

Multiplication: basic combinations, repeated addition, placing of partial prod- 
ucts, factors. 

Division: repeated subtraction, trial divisors, long and short methods. 
Operations with fractions, decimals, and percentages. 

Handling of zero. 

Removal of parentheses, transpo:dtion, cancelation, etc. 

Solution of simple formulas in one unknown. 

( ONCrPrS AND PRINCIPLES INVOLVLD IN QUAMIlATlVi: AND 
SJ^\TIAL RTLAIIONSHIPS 

I he dimensions in this category arc concerned with an understanding 
of the concepts and principles discus^-cd and developed in the study of the 
various symbolic systems in mathematics. Again the concepts and principle^ 
involved in mathematics are far too numerous to present here. We shall con- 
iine ourselves to listing some of the dimensions ol understanding mat pertain 
to all concepts and principles. These arc the abilities: 

1 To associate the concept with an example. 

2. To provide an illustration of the principle or concept. 

3. To develop a definition of the concept. 

4. fo recognize a misuse of the concept or principle. 

5. To identify the principle when used. 

b. To discriminate between the concept and other closely allied cc»n- 
cepts. 

7. To develop origins and justilieations U^r the concepts. 

8. To use the principle t > explain what has happened in a situation 

LOGICAL STRUCTURE 

As wc might surmise from our previous discussion of the nature of mathe- 
matics, dimensions in this category have onlv recently been developed and are 
only now beginning to receive serious attention in the schools. In dealing with 
this aspect of mathematics, knowledge of the meaning and significance of 
certain concepts is essential. Among the concepts are: statements, relations, 
operations, assumptions or postulates, definitions, implications, laws, proof. 



312 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


and theorems. At least some intuitive notion of their meaning is necessary for 
any consideration of the structure of ideas and consequently is a basic dimen- 
sion for this aspect of mathematics. 

For additional dimensions of achievement concerned with logical struc- 
ture, we draw upon the work of the Joint Commission of the Mathematical 
Association of America and the National Council of Teachers of Mathematics 
(9). In their view understanding of logical structure involves the following 
abilities: 

1. To formulate clear and concise statements. 

2. To identify the principle or assumption which organizes a given set 
of statements. 

3. To distinguish betw'cen words that need defining and words that are 
not to be defined. 

4. To identify and formulate the implicit assumptions in an argument. 

5. To follow a deductive argument. 

6. To organize statements into a coherent logical structure 

7. To formulate a logical argument. 

8. To identify and formulate implied relationships between statements. 

it is well to note that these dimensions can be applied to nonmathematical 
as well as to mathematical material. 

APPLICATION TO MAIIILMATJCAL PROBLI MS 

In this category come various aspects of problem solving and mathematical 
applications. Relating mathematics to practical situations is an important 
phase of the study ol mathematics, and for this reason we shall consider its 
particular dimensions 

In applications of mathematics, measurement plavs a vital role. In fact, 
measurement provides the connecting link between the symbolic systems of 
mathematics and the physical and social applications. In the problems and 
applications discussed here, we shall assume that the necessaiy measurements 
have already been made and are provided or aie readily available. Since 
measurement plays such an important role in mathematical applications, some 
understanding of the particular units of measurement that arc involved will be 
necessary. Such units are pounds, quarts, miles per houi, horse power, 
Fahrenheit degrees, cubic feet, etc. The meaning of these units must be under- 
stood before certain problems can be solved. 

Some of the dimensions identified for the application of mathematics are 
the following abilities: 

1. To select the necessary facts for the solution of problems. 

a. By telling what additional information is needed. 

b. By telling what information may be discarded. 

2. To formulate a problem from a given set of data. 



SCIENCE AND MATHEMATICS 


313 


3. To estimate the answer to a problem. 

4. To symbolize the components of the problem. 

5. To recognize the mathematical processes required for the solution and 
to set up the step-by-step operations. 

6. To perceive and express symbolically the unknown and the condi- 
tions for finding the unknown. 

7. To translate physical quantities and relations to mathematical sym- 
bols and relations. 

8. To develop a plan for solution (a search model). 

a. By considering analogous problems. 

b. By providing the necessary definitions involved. 

c. By varying the conditions of the problem. 

d. By identifying patterns and sketching figures. 

9. To generalize the problem and generalize the solution. 

Evaluative Standards in Mathematics 

Texts on mathematics education and instructors* manuals have been far 
more concerned with what students should learn and huw they may best be 
taught than with how their progress may he validly evaluated. 1 he bases for 
grades on the report card, for weekly marks, and for the teacher’s satisfaction 
in or concern over any pupil’s progress are much the same as for other sub- 
jects. Arbitrary percentages of problems answered correctly probably still 
constitute the most common basis for marks; 95 per cent and better equals A 
and so on to where 65 or 70 per cent is the barely passing mark. In many 
classes the use of percentages of correct answ'ers has been abandoned in favor 
of an established relationship between rank in class and a given value symbol. 
With such a system, the median or mean achievement of the class usually is 
the fulcrum for evaluation, higher ranks receiving the more favoral)lc marks, 
and lower, the less favorable. In other situations, quality of performance is 
reckoned in terms of mastery of a given set of operations, propositions, or 
theorems. Finally, the norms of standardized tests or of departmental tests 
sometimes are employed as the basis for judging tlic value of pupils’ w^ork. 

As w^e have observed in discussions of standards for other subjects, the 
use of evaluative standards such as these leaves several questions unanswered 
about the pupils’ competence. We still do not know^ what any pupil actually 
can do who gets an A or a C. Nor do wc know precisely what the F student 
cannot do that earns him the F. The best w'c can say is that the A student is 
probably better at mathematics than the C student, and the C student better 
than the F student. And if we arc thoroughly familiar with the content of 
mathematics in the grade or course in question, wc can surmise in general 
terms that the student should know^ and be able to do such and such. But the 
evaluative symbol based on percentage, rank mastery, or test norm does not 
in itself indicate this. 

For evaluations of progress in mathematics that are informative about the 



314 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


quality of the pupils’ achievement, it is iiecessary to use some type of per- 
formance scale representing gradations or levels of performance from that 
which is considered to be of least value to that which is of greatest value. A 
generalized form of such a standard was presented for understanding of any 
subject in Chapter 9. Earlier in this chapter one was described for a dimen- 
sion of science and in Chapter 1 1 a variation scheme was developed for eval- 
uating understanding of U.S. History. Now we will describe one for use with 
each of the categories of mathematics dimensions just presented. Examples of 
performance at various levels will be in terms of relevant mathematical prob- 
lems and/or their solution. Since, as w^e shall see, the primary procedure ot 
measurement in mathematics involves the administration and scoring of 
selected problems, the following illustrations of standards afford illustrations 
of appropriate measuring procedures as well 

COMPUIAIIONAI ircnNiQiJi s 

In the category of computational techniques, pcriormance levels arc de- 
veloped mainly on the basis of the seriousness of the errors made in computa- 
tion. Accuracy and facility are also factors to be considered. 1hc following 
example is based on a very specific computational technique, that of subtract- 
ing numbers in which there are zeros in the minuend (the top number). I'his 
is commonly taught in the third or fourth grade. 

tABit r 

Lxciniple of .in I \alualive Standard foi ^IllhmcUc ( oinpulalion 


Dimension' The ability to perform subtraction in the c.ise where (here aic /cios 
in the minuend (top number) 

l(\cl f^ri farniam e I xainplrs 

I No appirent idea of what to do about the /no 406 

in the minuend May make some attempts that —215 

arc completely wronj’ (Subtiacts 0 from 1) 

211 


(an subtiact correctly when no boiiowiiig is 

( an suhtr.ict correctly 

necessary or when at most onh one-step boi- 

406 140 

409 

rowing is necessary. Cannot subtiact correctly 
when two-step borrowing is necessary 

201 -215 

-1S2 

C annot subtrai t 


(Fails to boiiow second 405 

600 

step coriectly ) 

- 258 

-168 


|S7 

242 


in Can handle with facility and acc iiacy the zcios 
in various positions, both in the minuend and in 
the subtrahend. Can also handle two- and three- 
step borrowing situations correctly. 


( an subtract correctly: 
302 600 700 

-275 -389 -508 



SCIENCE AND MATHEMATICS 


315 


The performance levels have been described in terms of the seriousness 
of the errors involved. This basic error variation is considered to be keyed to 
varying degrees of understanding of the computational technique. Accuracy 
and facility are not specifically included. However, accuracy may be accounted 
for by variation in careless errors, and facility may be gauged by the variation 
in time required to complete the exercises. Only three performance levels have 
been identified here. After some experience with this scheme, the teacher un- 
doubtedly will wish to expand and refine these levels and perhaps add addi- 
tional ones. 

In many cases a teacher, in measuring ability to subtract such numbers, 
might simply construct a number of exercises involving zeros in the minuend 
and then evaluate his pupils on the basis of the total number of exercises that 
were answered correctly. Such a procedure is open to question For one thing, 
it could be that the test provided for no basic error variation and henLC did 
not measure variation in understanding of the subtraction technique. Starting 
with an evaluative standard composed of performance levels, however, should 
insure that a test will be devised to measure variation in performance. 

CONCEPTS AND PRINCIPETS 

Since the dimensions in this category are connected primal il) with the 
understanding of the concepts and principles involved in quantitative and 
spatial relationships, vve may apply the generalized standard for understanding 
presented in Chapter 9. I o illustrate performances typical ol the several differ- 
ent levels of understanding, wc have selected the c<mcepl of the area of a paral- 
lelogram, a subject commonly taught in the seventh and eighth grades 

lABlF 18 

r xample of an F.valuativc Standard for Understanding a Mathematical oncept 


Dimension: An understanding of the aica of a parallelogram. 

Lewt Pvf fonnance 

1 C'an repeal the foimal definition for finding the aiea of a parallclogi am pro- 
vided in class. Can also duplicate the figure drawn picvioiislv to illustraio finding 
the area 

Example: 



“Ihe area of a paiallelogram 
IS equal to the base times the 
height.” 

- h X ii 



316 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


TABLF 18 {Continued) 


II Can identify the process of finding the area of a parallelogram m diffeient 
circumstances 


Example Show how you would find the area for each of the following paral- 
lelograms 



III Can compare and lelato the piocess of lindiiig the aiea of a parallclogiam to 
finding the area ot other plane figures 

Example 



1 Draw 1 icctingk that has the same iica as this paiallelogram 

2 Draw a triangle that has the s ime aiea as this p iralklograrn 
1 Draw a tiiancle that his halt the are i of this paiallelogram 

IV Can pi»’dict and explain on the bisis of his understanding of ihc are i of a 
parallclogi am 

r xamplc 


C#. 




Points A B C /) oc all hinged 
so that the par illelogr im can be 
changed in shape but points A and B 
aie lixed in position 




1 Suppose that points C and D are moved to the right with points A and B 
fixed in position What happens to the area of the parallelogi un ’ Why’ 

2 Suppose that points C and D ar^ moved to the left, with A and B (ixcd in 
position What happens to the area ot the parallelogram^ Why ^ 

3 Draw another position for C and D where the irea ot the par illclogram is 
the same as above Give reason 

4 Suppose the t the height ol a guen p irallelogiam is doubled What happens 
to the area of the paiallelogiam ’ 


Creative Can leorgani/c the conwcpt of the area ot a parallelogram on a 
different basis 


Example 

This concept of area is discussed by Ft hr in Secondary Mathematu s ('?), pp 6-17 



SCIENCE AND MATHEMATICS 


317 


I’ABLE 18 {Continued) 



Start with line AB in position 
shown. Then move it to position 
A’ B\ keeping the line horizontal and 
fixed in length. What h the area of 
the space that was covered by line 
AB! 


See if you can continue with this 
idea and develop the areas of other 
figures, hor instance, what is the area 
of this figure? 


The levels of performance described in Table 18 are assumed to be 
cumulative. In other words, Level 111 presupposes performance at Levels I 
and 11. This example should indicate some of the many possibilities of the 
variation scheme for evaluating understanding presented in Chapter 9. 

LOCiU'AL STRIK PURE 

To illustrate the use of performance levels as evaluative standards for the 
dimensions of logical structure we have selected the dimension: ability to 
detect flaws in reasoning and to identify a valid argument. This dimension is 
significant at both the elementary and the secondary level. Moreover, the 
content for the examples in Table 19, the topic 'Tate of travel or S} ced,” is as 
frequently encountered in the elementary school as in the high schocl. 

Again it is assuciad lliat the performance levels in the standard are cumula- 
tive. It should be noted that the gradations of performance represent the detec- 
tion of progressively more subtle flaws in reasoning Ihc levels and examples 
arc meant to be illustrative r ithcr than definitive. To develop a standard for 
use with a class, a teacher should list many dilTerent types of fallacious reason- 
ing, try them out in several tests, observe how many students and what stu- 
dents detected each one, and then devise a performance scale to include the 
most sharply discriminating flaws. 

APPLICAflON OF MATHEM.XTu S 

This sO'Called practical dimension of mathematical ability is a particularly 
difficult one to evaluate. It is inextricably interwoven with the problem solving 
process in general, it may involve terms and concepts from many fields of 
knowledge, and it is often dependent upon ability to read well. If a pupil 
solves an application problem is it a function of his ability to apply mathe- 
matics or is it perhaps a function of his intelligence, his general knowledge, 
and his skill at reading as well as his ability to apply mathematics? 



318 


CUSTOMAHV USES OF MEASUREMENT AND EVALUATION 


TABLE 19 

Fxample of an Evaluative Standard for Logic 


Dimension The ability to detect flaws in reasoning and to identify a valid argument 

I t\t} Perjormanc i 

1 I ails to identify any flaws in an argument, even the most obvious flaws Dexs 
not investigate the stiucture of an aigument but judges the validity of an argii 
ment only on the basis of whether the conclusion is true or filse in his estim i 
tion 

Example I he following aiguincnt is presented, “It takes longer to go to the 
post office than to go to the bink from school Therefore the post office is 
further from the school than the bank is*” 

Pupil believes this to be a valid argument btcuise he assumes that the distance 
from the school to the post office and from the school to the bank is the onl> 
factor affecting speed of travel 

ir C an pick out the more obvious 11 iws such as idcntilying the missing step in m 
argument and Tccogni7ing the irrelevant i se or misuse of i definition or pnii 
ciple 

Example Pupil lecogm/es tint the aigument picsented in 1 cvlI 1 is ii \ ihcl 
by identifying the sliicmcit missing nirnely that the rate of li ivcl must be 
constant iiid equ il in both c iscs 

111 Can p ck out the more subtle fliws of m ugumLnt sudi i jssunung th it the 
converse oi opposite of an implication is true using a non st qnitin and begging 
the question 

Fs impk T fu following irtumenf is picsented If the e ir was tiavdint 60 
iTiilts pel houi then the driver Wis exceeding the speed limit But the e ir wis 
not triMlmg 60 miles per hour thtiefore we win conclude th it the dnvci wis 
not excLcding the speed limit 

Pupil lecogni/cs that this is m i u ilid iigiimtnt because of the tillicv of 
assuming that the opposite of an implication is tiuc 


In keeping with this general difficulty m evaluation, the development of 
a performance scale lor use as a evaluative standard is especially difficult No 
generalized one seems possible since each application of mathematics »s a 
function not only of mathematics but of the application situation as well Par- 
ticularized ones could be developed for each specific application but their 
number then would be legion For these reasons, ^\e shall not provide an illus- 
tration of performance levels for this dimension Instead, we shall indicate 



SCIENCE AND MATHEMATICS 


319 


certain types of variable that may serve to determine the efficiency with which 
a student applies his mathematical knowledge to other areas. 

Among the possible variables arc: 

1. Generalizations, principles, laws known. 

2 Extensiveness of manipulative skills. 

3. How abstract arc conceptions of mathematical operations? 

4. How many elements can be manipulated in reaching solutions? 

5. Skill at symbolization. 

6. Discrimination between relevant and irrelevant operations in any prob- 
lem. 

7. Facility to reverse a process learned in one wa>. 

Forms and Procedures of Measurement in Mathematics 

As with science, classification and rank s>mbols are the most widely used 
forms for expressing measures of achievement in mathematics They arc ap- 
propriate both to the nature of the dimensions and to the evaluative standards 
Scale numbers have somewhat more of a place in mathematical measurement 
than in science. So far as the manipulative dimensions are concerned, instm- 
ments may be devised to yield scores that may be treated like scale intervals 
More perhaps than in any other school subject is it possible to construct test 
Items that have equivalent difficulty. For example, if the items arc all addi- 
tion problems involving three two-place figures, it may be assumed that any 
problem is as difficult as any other. Hence, the dilTcrcnce between a score of 
70 and a score of 80 may be assumed to be equivalent to the dilTcrcnce bc- 
tw'ecn a score of 30 and a score of 40, and the scores may be added, means 
determined, etc., with mathematical validity. 

All procedures of behavioral measuicment are applicable to mathematical 
dimensions. However, because of the exact nature of the material, guided 
response tests aie particiilarU^ appropriate As you may recall, the gist of a 
guided response item that m ‘asures achievement (true-false, multiple-choice, 
etc.) is that responses mean one thing and one thing only. Nowhere is this 
condition more likely to obtain than in mathematics, where by definition there 
usually is only one right answer. Rivaling guided response tests as an eiTectivc 
means of measurement in arithmetic, algebra, etc., is the analysis of pupil 
assignments done for instrucMonal purposes. 

A number of illustrative tc,>t items arc presented in I ablcs 20, 21. and 22 
for each main branch of mathematics taught in the common schools. They 
include items appropriate to each of the four basic dimensions of mathe- 
matics: manipulative techniques, concepts and principles, logical structure, 
and applications. 

Arithmetic. 1 able 20 presents a few examples of guided response items 
in aiithmetic selected to show some variety in content and form. 



320 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


TABLE 20 

Examples of Test Items for Arithmetic 


1 In which of the following numbers is the tens digit the smallest' the gieatest'^ 

(i) 4^02 (b) 63 (c) 8,563 432 (d) 352 

2 If d digit IS moved 2 places to the left, how miny times has its value been incieased'^ 

(O 200 (b) 10 (c) 5 (d) 100 

3 The C ub Scouts sold candy at a carnival In order to find their piofit, list the things 
you would need to know 



Shade three eighths of this ciicle 


What part of this rect ingle is shaded*^ 


6 In which of the following divisions would the fiist tn divisor be 5> 

(0 16 /roTT (b) 5) /ToT? (c) 19 17)35 (d) 25 /Tim 

7 Susin made punch b> mixing one cup of orange ji ice to three cups ol ginger ale 
What peicentage of the punch is orange juice ^ 

Suppose she iddcd another cup of gmgei ale Whit pel mintage of the piineh is 
orange juice ' 

8 If SIX IS added to both the numerdor ind denominitor of the frutioi 3/5 whit 
happens to the size of the fraction ' 

9 _ Fstimile a line ih it is 5/16 of the given line 

10 The population of Ciitleville is now 35 000 It his mcieased 20 per cent since 5 

ycais ago What w is us popul Uion 5 yeirs ago' It its growth is ste idy whit will 
Its population be 10 yeirs from now*^ 

T he items shown in 1 able 20 are the type usually encountered in arithme- 
tic Further variety can be gained by 

1. Asking questions on portions of a computational process 

2 Asking m words for the computation to be performed rat! er than having 
the procedure already set up in symbols 

3 Asking questions about the results of a computation 

4 Changing values of parts of a completed computation and asking what 
happens to the results 



SCIENCE AND MATHEMATICS 


321 


Algebra. The items shown in Table 21 for algebra arc taken from ele 
mentary algebra and are concerned primarily with basic definitions. 

TABLE 21 

Examples of Test Items for Algebra 


I. (1) ~2k (2) Ik (3) 1 (4) k (^) 0 

2- — ■^= (1) * + y (2)4x-h4y (3)--^ + ^ + 

3. (1) 5 (2) 0 (3) 2i <4) 1 (5) ^ 

4. ^ (1) 7 ( 2 )^^ (3) t (4)277 

5 '2L-E.— (1) "J--P (2) ~ ^ "C. ( 4 ) — ‘ ~'"P 

" n q n — q nq —nq n — q 

(5) None of these. 

‘■|-,7= I" V- <"' <’> 1 '“'--a" 

(5) None of these. 

7. — r2=r (I) ^ (2) 1 (3) r'> (4) 0 L5) None of these. 

8. If W 2 is 6% of n and /?i 12^ then n is equal to 

(1) 2 (2) 200 (3) 500 (4) 5 (5) None of these. 

9. What is the value of \ + </- 

(1) (2) J (3) J\2 (4) 2d (5) V27/ 

10. A number r is inci eased by 35%. What is the result.? 

(I) ().3.5r (2) ;f 35 (3) 1.35: (4) z -i 0.75 (51 ^ + 7 ^ 

II. If HI = tlien <7 is equal to (l)~ (2) ^ 7 , (3) l4) ”^7 

(5) None of these. 




(a) Which of the above curves indicates that as y incieascs x decreases'^ 

(b) Which indicates that x varies directly as y'^ 

( c) Which indicates that as y increases, a remains coii'^tant? 

(d) Which indicates that a and y are everywhere equal? 



322 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


The questions listed are particularly useful for diagnosing basic errors since 
most of the common errors are listed as alternate choices. Other items can 
be developed by using algebraic symbolism to express functional quantitative 
and spatial relationships. 

For example: 

1. A is z miles behind B. A\ speed is a miles per hour while B s speed is 
y miles per hour. If A's speed is greater than B\, how long will it take for A 
to catch up with J5? 



Whtil IS ihe area of this figure’ 


Geometry. Our illustrations for geometry all involve movable geometric 
figures that call upon the student to apply his understanding of the properties 
of these figures, not just to repeat an operation he has learned in class. These 
questiems could be used with students who have been studying intuitive geom- 
etry as well as those who have been studying demonstrative geonfetry. 


TABLE 22 

Examples of Test Items for Geometiv 



fins parallclopiam is lunged at each 
veiicx and the sides are made of 
rigid material As the parallelogram is 
moved and changes shape, what hap 
pens to the diagonals'^ Of what kind 
of material should the diagonals be 
m.idc ’ 



1 he vertex C of tnaingle ABC moves 
in tlie direction indicated parallel 
to AB, while points A and B re- 
main fixed in position. What happens 
to the area of the triangle? What 
happens to angle 



SCIENCE AND MATHEMATICS 


323 


FABLE 22 {Continued) 



Point L movLs along AB while poiiv 
D moves along AC so that line Dl 
remains parallel to line BC 

(a) What happens to an^’le \ as /J/ 
moves lowird B( ^ 

(b) Whit happen^ lo angle x as DI 
moves away from BC ^ 



It the thiee sides of the right tinngle 
^hown are each doubled in length 
vsh it happens lo the size of the ingles 
in the trundle’ What h ippen-> to the 
arc i'’ 



If the laduis of the cikIc sho\^ n is dcciviscd M) 
cent, what happens to circumtcrcncc ^ lo the aica"* 



In this ligure points A and B renmn fued in position 
while point C moves around tlu ».ircK in i Jock 
wisi diiection toward B 

(a) Whrtt happens to mule ‘ 

(b) What happens to mgle \ ’ 



If angle x and ingle a uc dtsiciscd or hilt, then 
mglc z IS 

i I ) reduced one half ^ 

(2) increased one half’ 

( 1 ) doubled t 

(4) the same ' 

(5) \nswcr not given ^ 



324 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


TABI t 22 {Continued) 



This figure rtpiescnts a right circular cylindei having 
a b\sc of ridiiis r and altitude h 

(i) Whit happens to the volume of the cylinder if 
the cireumfeience of the base is doubled while the 
altitude reni iins const mt ^ 

(b) What hippens to the volume if the radius is in 
creised SO pei cent and the altitude is meieased SO 
per cent^ 



( i) Wh it IS the ue i of the sh uled poition'’ 

(b) Suppose / IS doubled in length Whit hippens 
to the sh ided ire i 


If the pciimit I of this squire i doubled whit hap 
pens to the »rea^ 


Many ofhci relational questions can be devised usinc moving evoinctne fig 
iires This t\pc of question is partieulail} eneetive in testing the students’ 
understanding of geonielrie concepts in dvniniie situations rather than in the 
static situations usuall) presented in high school textbooks 

S7AMMRDI/fL) IF SIS IN MATHTMArKS 

I he standardized tests available in mathematics arc listed in publishers’ 
catalogues and the siandaid reference" work for published tests Tlu Mental 
Mcasiucments Ycaihook edited by Buros ( 1 ) The ^ lotbook contains critical 
reviews of many of the tests and reviews of recently published ones may be 
found m mathematics and educational journals Selected tests arc cited in 
Appendix B ol this text 

Many mathematics tests cut across subject matter lines Several co\er 
all phases of high school mathematie^ and are designed to serve as college 
entrance examinations Others arc devoted to measuring achievement in gen- 
eral mathematics, including algebra ind geometry Some tests attempt to 
measure functional competence or the ability to do quantitative thinking In 
general, published tests in mathematics usuaMy are employed on a school- 




SCIENCE AND MATHEMATICS 


325 


wide basis for purposes of ability grouping, guidance, etd., and by colleges as 
part of their entrance examination batteries. 

Tests of arithmetic measure computational ability primarily, but some 
include items devoted to arithmetical reasoning. Reviews of more recently 
published tests indicate that test designers are beginning to emphasize the 
meanings involved in the manipulative processes and are developing novel 
approaches to measuring pupils’ understanding of them. Some diagnostic tests 
are available that attempt to help the teacher identify the basic dithcullies 
pupils have in arithmetic. 

Standardized tests in algebra include readiness or prognosis tests as well 
as achievement tests. The prognostic instruments are based primarily on the 
assumption that a good fundamental understanding of arithmetic is tiie best 
predictor of success in algebra. The achievement tests generally give satisfac- 
tory coverage to factual content and basic skills. 'The tests pubh.->licd for 
geometry tend to be factually oriented and to give too little attention to ap- 
plications, to inductive development, and to the structure of ’■easoning and 
proof. 

The following is a summary of the number of pubhsln i standardized tests 
in different mathematics areas listed in The Fourth Mental Measurements 
Yearbook: 


Suhjec t Number of Vests 

Mathematics in general 15 

Algehia 18 

Arithmetic 24 

Cicometry 16 

Tiigonometry 3 

Special Problems of Measurement in Mathematics 

Reviewing our discussion of mcasuren»ent and evaluation in mathematics, 
the teacher of mathematics sec .ns to be faced with at least three special prob- 
lems. 

1. 1 here is need to avoid the strictly manipulative type of questions where 
the numbers are all set up for the process to be carried out by the student Ti 
is tempting to compose tests of this type of item since the questions a'ppcar 
ready-made in the textbooks. However such items usually will measure only 
rote memory aspects of arithn'etic achievement. 

2. ft is especially dilTicult to use the closely knit symbolic systems of 
mathematics in developing questions that show the principles of valid reason- 
ing and the structure of relationship of ideas, 

3. Jt will be necessary for the teacher to devise his own variation schemes 
or performance scales for evaluating dimensions of mathematics, particularly 
those involving an application of mathematics to everyday practical situations 
and an understanding of the concepts commonly taught in mathematics. Pupil 



V 


326 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

texts and instructors’ manuals provide little assistance in this wise, and class- 
room practice offers few examples of the use of such evaluative standards. 

Summary 

Science has two major aspects, its body of systematized knowledge and its 
basic process or approach known as the “scientific method.” Many profes- 
sional groups have defined the dimensions of pupil achievement in science. 
One particularly useful list includes these items; functional information, con- 
cepts, and understandings concerning the physical universe; instrumental skills 
such as to read science content, perform arithmetical operations, manipulate 
science equipment and make accurate measurements; problem solving skills 
such as to define a problem, test a hypothesis, etc., and attitudes of open- 
mindedness, intellectual honesty, and suspended judgment. Before attempting 
to measure these or other dimensions, they should be checked against the five 
conditions established for measurable dimensions in Chapter 2. 

In common use as evaluative standards are the objectives of instruction 
themselves, range and mean of class achievement, and the teacher’s general 
feelings about the value of any pupil’s performance. Preferable to these is a 
performance scale or hierarchy containing several levels of performance, from 
that which is unacceptable or barely acceptable to that which is exemplary for 
a given grade of students. Such a standard for evaluating elementary pupils’ 
ability to formulate an experiment that will answer a given question might 
consist of three levels: plans an inappropriate experiment; suggests an experi- 
ment that points toward the answer but is unrealistic in its arrangements or 
fails to control all the essential variables; proposes an experiment that should 
give a clear-cut answer to the question. 

As in other areas, forms of measurement should be used that are appro- 
priate to the dimensions and the evaluative standards to be applied to them 
In science this makes classification and rank symbols the primary forms of 
measurement. All types of behavioral measuring procedures are applicable to 
science courses. Guided response items have been devised to appraise dimen- 
sions other than the facts-known one. Observation is the most valid procedure 
for appraising the problem solving ability of students, their sensitivity to their 
environment, and their proficiency in the laboratory. Many well-designed 
standardized tests arc available for use by teachers but they should be used 
in keeping with certain requirements. Test dimensions and coverage should be 
the same as the dimensions and coverage in the course in question. Students 
in the class should be like those on whom the test was standardized and tech- 
nical aspects of the test should reflect prevailing best practice. 

Historically, mathematics has been taught as a body of n anipulative op- 
erations and concepts. Today, however, its significance as an idealized lan- 
guage is being realized and mathematics instruction is more broadly conceived 
than heretofore. Its four basic dimensions arc computational and manipulative 



SCIENCE AND MATHEMATICS 


327 


techniques, concepts and principles involved in quantitative and spatial rela- 
tionships, logical structure, and application to mathematical problems. 

The commonly used standards for evaluating achievement in mathematics 
are arbitrary percentages of problems worked correctly, the mean of any 
class’s performance, and mastery of a given set of propositions or operations 
Use of a performance scale as a standard is considered superior to any of 
these. For manipulative techniques, a scale based on types of errors is appro- 
priate; and for concepts and principles, one involving the degree to which a 
given concept may be adapted to more subtle problems has promise 

All forms of measurement, including scaling for certain tests ot computa- 
tional skill, are applicable to measurement. The principal procedures of meas- 
urement employed arc guided response tests and analy.ses of pupil assign- 
ments. Illustrations are given for types of items appropriate to arithmetic, 
algebra, and geometry. Many standardized tests arc published for measuring 
mathematics achievement. Their usefulness is limited by their generalized 
coverage and their emphasis on computational and operational skills There 
is a tendency for more recently published tests to deal with meaning and the 
logic of mathematics as well as with manipulations. 


LXHRCISES 

1. Develop an evaluative scheme for evaluating knowledge of important seien- 
lific facts. What basic dimensions would vou have to considei? 

2. Examine three standardized tests in science or in m.ilhcmatics and make 
a critical analysis of them in terms of what dimensions are being measured. 

3. Select a specific dimension which would be most likely to oc.ur in yoiii 
science or mathematics teaching. Develop an evaluative scheme in teiins of levels 
of performance and then construct a few qucslions appiopriate toi each of the 
levels. 

4. In what ways aie the me.isuiemeni and evaluation pioblems in science 
and in mathematics instruction alike and different? 

5. Why have the dimensions o( computational skills in arithmetic been so 
carefully and systematically analyzed? 

6. Select a word-problem in a mathematics text atid develop an tvaluativc 
scheme you w'ould use to evaluate the answers of pupils to the problem. 

7. Develop an observation form ihai you believe would measure pupils scien- 
tific attitudes in your class. 




CHAPTER 13 


PERFORMANCE-ACTIVITY AREAS 


If wc were to walk through a school today, we would witness a great 
variety of activity. In the play yards, athletic fields, and gymnasiums wc would 
watch boys and girls playing games and learning athletic skills. In the shops 
we would see students busy with the projects in woodworking, auto mechanics, 
electricity, and sheet metal. In art and music rooms pupils would be painting, 
drawing, modeling, singing, and playing instruments. Girls in home ec<momics 
classes would br learning how to prepare meals, to sew, ami to care for a 1am- 
ily. And so on. From our trip we would have to conclude that a sizable portion 
of the school’s curriculum is devoted to the development ot various manipuh'- 
tive and motor skills and that the schools today are “activity conscious.” 

'This chapter is devoted to the general problems of measuring and evaluat- 
ing achievement in subjects which emphasize motor performance and manipu- 
lative activities. These subjects arc art, business education, driver education, 
home economics, industrial arts, music, and physical education. Moreover, 
wc will restrict our treatment to the performances and activities themselves 
and to the products ot these performances. The appraisal of knowledge and 
attitudinal phases of achievement in these subjects may be performed as it is 
in any of the so-called “academic” subjects (see Chapters 5, 6, 10, 11, and 
12). Wc shall discuss in turn the measurable dimensions of performances and 
products in the subjects, the measuring procedures to be used, and standards 
for their evaluation. The subjects of art, music, etc., will not be treated 
separately, but sutficient examples will be presented lor each to show their 
unique problems and techniques. 

Process and Product. At the outset we need to dillercntiate between 
two important aspects of a performance or activity, these being the pn^cess and 
the product. The process in carrying out a performance refers simply to the 
movements involved and their sequence. The product ot the performance, on 
the other hand, is the result or outcome of the process. The product may be 
tangible, as in the case of woodworking, or it may be intangible, as in the 
case of musical selection played on an instrument. Whenever the product of a 
performance is intangible, the process and product arc so intimately interwoven 
that it is diflicult, if not impracticable, to separate the two. Such is the case 
in singing, swimming, dancing, kicking a football, throwing a baseball, etc. 

329 



330 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

The degree of emphasis on process or product depends upon the subject 
and the objectives of the course. In the matter of painting, generally the prod- 
uct receives more attention than the physical movements involved: i.e., hold- 
ing the brush, mixing the paint, etc. As well, there are courses in which the 
process or procedure receives nearly all the emphasis. In a beginning golfing 
or archery class the emphasis may be solely upon form or procedure. In most 
cases, however, process and product seem to have equal significance. As a 
rule, the product of a performance is alTccted greatly by the process that 
preceded it. Furthermore, the process is determined and modified largely by 
the desired product in mind. 

Measurable Dimensions of a Performance in Progress 

We shall consider now some of the general dimensions of any performance 
which we can reasonably expect to measure. The discussion in Chapter 2 on 
the essential characteristics of measurable dimensions is pertinent to the 
problem and it might be well to review them (pages 19-26). In brief, a 
measurable dimension is relevant to a class ot things, manifests variation, 
provides sensory data, is clearly defined, and produces consensus of reaction 
among impartial observers. 

1 . Speed, One of the more obvious aspects of a performance is its speed. 
This is very simply measured by the time it takes a pupil to complete the 
performance. Comparisons may be made with other timed performances in 
order to get some idea ol how last relatively this one is Speed i^ an essential 
factor in most competitive performances and in such others as typing and 
shorthand. 

2. Accuracy. Another importajpt and common dimension of a perform- 
ance is accuracy. As a matter of fact, we hear the “speed and accuracy” of 
performance mentioned more than any other aspects. Accuracy is commonly 
measured in terms of error counts. This presupposes, of course, that there 
exists a rather detailed concept of how the perlormancc should be carried out 
ideally, and any deviation from this ideal is to be considered an error. Several 
types of errors can be identified, and consequently accuracy is a very broad 
dimension. Precise measurement usually requires that accuracy be broken 
down into some of its subdimensions, two of which are listed below. 

a. Procedural enors. This type ot error requires the existence of only one 
correct sequence of steps or only one pattern to follow in carrying out the 
performance. It would mean that the procedure has been formalized or stand- 
ardized by society in general or by special groups. Usually the given proce- 
dure has been established on the basis of experience and general acceptance, 
but in any event a deviation from the procedure would be counted as an error. 
This would apply particularly to performances governed by generally accepted 
rules of etiquette, as in serving meals or in dancing, or by legal procedures, as 
in driving an automobile. 

b. Errors in jollowing instructions. These errors would occur when the 
performance is made in response to instructions. Deviations from the given 



PERFORMANCE-ACTIVITY AREAS 


331 


directions would indicate a degree of noncoinpliance with the requirements 
of the task. Errors of this type may be due to actual inability to carry out the 
instructions or they may be due to inadvertent mistakes. Typing errors are a 
good example of the latter. 

3. Discrimination. This dimension involves the selection or choice of 
tools, equipment, and movements used in carrying out the performance and 
the perception of stimuli that accompany the performance. Measurement of 
the dimension is dcinc in terms of adequacy and effectiveness tor the opera- 
tion performed. Discrimination is an important dimension in woodworking, 
electricity, auto mechanics, and in other crafts where several tools and pieces 
of equipment arc used and where the thing being made or repaired must be 
perceived carefully and accurately. It is also important in music, where sounds 
must be heard aright and where tones and finger movements must be selected 
with exactness. 

4. Economx of effort. Another important dimension in performance is 
the economy of eflorl involved. Here we look for the amount of effective mo- 
tion as against the amount of “lost motion” or trial-and-crror behavior. 1his 
aspect of performance is closely related to speed in that the more economy of 
effort there is, the greater will be the speed of the performance. There is also 
an element of discrimination involved whenever a choice of movement is 
made. In most instances, however, the dimension of economy of effort is a 
matter of w^ell-traincd and co-ordinated muscular movement that makes the 
performance seem easy and effortless. This attribute is particularly desirable in 
dancing. It is always an essential element in any activity that requires a great 
deal of muscular co-ordination, such as swimming, playing football, or basket- 
ball and gymnastics. 

5. Timing. This dimension has to do wdth the rate and emphasis of move- 
ment in a complex motor performance. Tn the operation of a piece of ma- 
chinery such as a lathe, a crane, or a bulldozer, the several levels and wheels 
must be operated at exactly the right time and to the right extent. 1 he dimen- 
sion of “timing” is also involved in any team play, in gymnastics, and, of 
course, in music and dancing. 

6. Intensity. Another component of a performance to be considered is 
the intensity of the action. The outward manifestations of this dimension 
would be the force or amplitude of the movements involved. Intcnsiiv is of 
particular importance in performances involving strokes, such as tennis, golf, 
handball, or baseball. The desirable degree of intensity is dependent upon the 
requirements of the task at hand. It is possible that there may be too much 
force exhibited or too little, and measurement will have to be in terms of de- 
viation in cither direction from some optimum degree of intensity. 

7. Coherency. This dimension applies to performances in which there is 
no single correct procedure or sequence of steps for carrying out the tasks 
involved. In such a case it would be impossible to measure each performance 
step in terms of adherence to. an ideal procedure. Consequently, actions must 
be judged on the basis of their internal consistency or their mutual appro- 



332 CUSTOMARY U^ OF MEASUREMENT AND EVALUATION 

priateness. Measurement of coherency, then, is a matter of the degree to 
which one step logically follows from a previous step or logically leads to the 
succeeding step. This dimension is important for woodworking and other 
shop-type activities involving many individual steps, for which there is no 
single correct sequence. 

Measurable Dimensions of Products of Performances 

We need now to consider briefly the aspects of products that lend them- 
selves to measurement. In art are produced drawings, paintings, sculptures, 
and mobiles. In shop, students make shelves, ash trays, and mail boxes. 
Dresses and foods arc produced in homemaking, and in business education 
there are type scripts and shorthand notes. Such products have basic physical 
dimensions (weight, size, color, etc.), which may be measured by the con- 
ventional techniques of physical measurement. Of greater interest to the 
teacher of a performance subject, however, are the dimensions of these prod- 
ucts that have to do with ihcir significance or purpose. Among thc-.c dimen- 
sions are the “composition” of a painting, the “flavor’* of a cake, and the 
“evenness of stroke” in typing. Often the dimensions of most importance to a 
teacher are the details that have been speciiied for a product by a pattern, 
recipe, or set of instructions. 

There is far less commonality among the dimensions of products than 
among the dimensions of performances. Consequently, no cfTort^will be made 
to define any general dimensions for them as wc did for performances in 
various subjects. The only dimension which approaches general significance 
is that of accuracy, and even this may not be relevant to certain art products 

Attempts often are made to riieasure the aesthetic attributes of certain 
products. This is particularly true in art and music, and often true of products 
in home economics. The measurement of aesthetic properties is much more 
complex and difficult than the measurement of physical properties. 1 his is 
due, of course, to the fact that aesthetic properties often fail to satisfy the 
conditions essential for measurability (see pages 20-24). If it is necessary 
to measure any aesthetic dimensions of a product, the measurer must define 
these dimensions in such a way that they are measurable. For example, “com- 
position” may be an aesthetic dimension of a painting. Before “composition” 
may be measured other than subjectively, it has to be defined in terms of line 
convergence, focal point, balance, etc. Such definition of aesthetic properties 
in measurable terms is illustrated later in the chapter in connection with the 
design of measuring procedures (see page 345 ) . 

Dimensions of Performances and Products in Specific Subjects 

So far our discussion of the dimensions of a performance and its products 
has been somewhat general. We need now to view some of the specific dimen- 
sions that teachers try to evaluate in subjects where activity predominates. To 
attempt an exhaustive listing of all the specific dimensions of performance in 



PERFORMANCE-ACTIVITY AREAS 


333 


each course would be a vain undertaking since their number is legion. Con- 
sequently, wc shall present only some or the more common ones to be en- 
countered in each subject. These should provide a starting point for measure- 
ment in the subjects and, in addition, they should suggest to the teacher other 
dimensions in which he will be interested. 

The sources of the dimensions to be cited arc textbook.^ lor the subjects, 
teachers’ guides and methods texts, courses of study, and the few standardized 
tests published for performance-activity subjects. The dimensions arc listed 
in outline form for each subject. Some are actually subdimensions of others 
and you may notice that many are merely specific instances or applications 
of the general dimensions we have been discussing. Furthermore, some of the 
dimensions, as stated, do not meet the basic conditions of measurability and 
will have to be redefined in behavioral terms. 

Art. First wc shall list the specific dimensions commonly mentioned in 
connection with the products of art. The appropriateness of any on^‘ will 
naturally depend upon the medium used. 

Art Product.'* 

Organization or composition 

Balance 

Rhythm 

Color 

Contrast 

Repetition 

Sympathy lor subjects 
Line quality 
Tone quality 
Spatial relations 

Accuracy ol proportion or suitability ol distortion 

Stability of subjects 

Ease of interpretation 

Suitability ot medium tor puipose 

Relationship ol proportions 

Textural interest 

Kinesthetic interest 

Technical facility 

Expressiveness and originality 

A rtistic Behavior 

Perceptiveness to art in everyday living 

Ability to enjoy art spontaneously 

Communication ol ideas clearly in artistic expressions 

Ability to criticize and to profit by critieisn of art expressions 

Ability to organize artistic forms lor certain purposes 

Degree of independence and originality in art expiessions 

Application of art principles to activities outside ol art class 



334 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


Music. Several of the dimensions cited for art are equally pertinent ^o- 
music. These are rhythm, sympathy (for selection), tone quality, case of in- 
terpretation, technical facility, and all those given for “artistic behavior.” In 
addition, there are a number of other important dimensions. 

First, much research has been done in studying discrimination in music. 
Two standardized tests in particular attempt to measure this dimension, 
namely, the Seashore Measures of Musical Talent and the Kwalwasser- 
Dykema Musk Tests.^ In the Seashore test, the student is required to make 
discriminations for each of the following <'Ubdimensions; 

Pitch Timbre 

Loudness Rhythm 

lime Tonal memoiy 


The Kwalwasscr-Dykema test measures 
elude many types of discrimination: 

Tonal mcmoiy 
Quality discf immation 
Intensity discrimination 
Feeling foi tonal movement 
Time discrimination 


the following dimensions, which in- 

Rhythm discrimination 
Pitch discrimination 
Melodic taste 
Pitch imagery 
Rhythm imagciv 


Subdimensions ol accuracy arc important m music and tht^ dimensions 
contained in an early test, the HiUhtand Su^htsingin^ 1 est, serve to iJlustratc 
them “ In this test, the pupil briefly studies each song presented and then sings 
without accompaniment. Ihc following types of errors are noted 

1. Notes wiongly pitched 

2. Transpositions 
Times flatted 

4. Times sharped 

5. Notes omitted 

6. Errors in time 

7. Extra notes 


Our final group of specific dimensions in music concerns some of the as- 
pects of instrumental performance The exact appiopiialcness of each di- 
mension, of course, depends upon the instrument being played. 

Tone; Beauty and quality, intonation, fluency, and modulation. 

Technical proficiency. Fingering, precision, intervals, breath support, tonguing, 
attack and release of tone, accuracy of notes and rhythm 

1 New York, Psychological Corporation, 1939, and New York, Carl Fischer, Inc., 
1930, respectively. For more complete data on standardized tests in music see Appendix 
B, page 482. 

-Yonkers, N V., Woild Book Company, 1923. 



P|RFORMANCE-ACTlVlTY AREAS 335 

• Interpretation or musicianship: Phrasing, tempo, dynamics (shadings from 
soft to loud), rhythmic flow, expression, contrast, mood, naturalness, balance. 

General effect: Sincerity, discipline, stage appearance, confidence, spirit, pos- 
ture. 

Industrial Arts, First, we shall indicate several of the dimensions of the 
working process in a shop class, and following this will be some of the ini> 
portant dimensions of shop products. 

Shop performance: 

Ability to follow directions 
Care of materials 

Skill in handling tools and equipment 
Observance of safety precautions 
Adaptability when difficulties arise 
Ability to plan a procedure 
.Ability to prepare a bill of materials 

Understanding of limitations and capabiliiies ol tools and equipment 
Shop products: 

Correspondence of finished product original plans 

Neatness of over-all appearance 

Accuracy of angular mcasuiernents 

Suitability of finish 

Appropriateness of materials 

Fit of joints 

Accuracy ol dimensions 

Business Education. Performance in typing, shorthand, business ma- 
chines, and filing (the activity aspects of business education) arc mcasujcd 
largely in terms of the two primary dimensions of nearly any pcrlonnance: 
speed and accuracy. Among the few special dimensions involved me- 

Typing: 

Posture 
Hand posilion 
Stroke 

Paper insertion, margin setting, elc. 

Care of machine 
Appearance of copy 
Adherence to proper forms 

Shorthand: 

Clarity of characters 

Business machines: 

Setting up for given operation 
Checking or proofing 
Care of machine 



336 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


Home Economics. The dimensions of performance in home economics 
have to do with all the various subdivisions of the subject : home management 
and family relationships, foods and nutrition, and clothing and textiles Among 
them arc the following: 

Home management and family relations: 

Planning family responsibilities 
Budgeting 

Investment and financial planning 
Purchase of furniture and equipment 
Care of furniture and equipment 
Planning interiors 
Decorating 
Child care 

Foods and nutrition: 

Meal planning 
Preparing and serving meals 
Purchasing and storing food 
Applying principles ot nutrition 

Clothing and textiles: 

Selection of clothing and household fabrics 

Making, repairing, and altering clothes (many detailed dimension^ here) 
Laundering 

Use of sewing machine 

Choice and use of textures, color, and style in fabrics 

Products in Home Economics are largely restricted to food prepared and 
garments made. Among the commonly measured dimensions for foods and 
clothing arc: 

Food preparations: 

Appearance 
Consistency 
Flavor 
Color 
Lightness 
Greasiness 
Size of sciving 

Garments: 

Attractiveness: Color and texture 
Choice of trim 
Pressing and cleanliness 
Style 

Individuality 

Suitability of style and fabric used 


Tcxtuie 

Taste 

Tenderness 
Moisture content 
Odor 

Arrangement 



337i 


PERFORMANCE-ACTIVITY A^EAS 

Workmanship: General construction 
Seams 

Fitting details (gathers, darts, pleats, tucks) 

Finishing details (collars, cuffs, waistbands, zippers, hems) 
Wearability 

Physical Education. Performance in physical education involves general 
physical condition, general physical performance, fundamental athletic skills, 
and skill in specific games and sports. 

Physical Condition. Various anthropomorphic measurements play an 
important part in physical education instruction. These include measures of 
such dimensions as height, weight, chest, stature, vital capacity, and respira- 
tion. These measures are often combined to establish various indexes such 
as build index, vital index, and ponderal index. These indexes, together with 
measures of motor performance, arc used in the classification of students for 
equality of competition. 

Physical Performance. Various aspects of physical performance arc as 
important as physical condition in determining a pupil’s achievement in physi- 
cal education. Some of the more important dimensions of physical perform- 
ance arc: 


Motor educability (the ability to learn 
ncv^ motor skills) 

Speed 

Endurance 

Balance 

Flexibility 

Power 


Form 

Adaptability 

Co-ordination 

Rhythm 

Strength 

Agility 


Athletic Skills. Certain fundamental athletic skills are believed to be 
basic to nearly all sports and athletic performances (exclusive of swimming). 
Among those which have been isolated are the following: 


Running 

Throwing 

Kicking 

Bulling 

Hand — eye co-oi dination with an 
implement or bat 
Moving quickly while carrying an 
object 


Jumping 

Catching 

Pushing 

Dodging 

Strength of torso muscles 
Balance 

Ability to got over an obstacle 
Control of body in air or while hang- 
ing by arms 


Sports Skills. Because of the great variety of sports, it is impossible in 
this text to list even a few of the special skills involved in each. Many in- 
structional manuals have been published for the sports and games practiced in 



338 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

the schools and these indicate their specific dimensions. Among them are 
found such unique things as stroking out of a trap (golf), free throws (basket- 
ball), and sliding (baseball). 

This completes our enumeration of some of the specific dimensions in- 
volved in activity subjects. The identification of appropriate dimensions is 
the first step in any measurement process and the above lists, together with 
the discussion of general dimensions, should indicate the scope of the task. 

Forms and Procedures for Measuring Performances and Products 

Now that we have determined to some extent wluit is to be measured in 
the activity subjects, wc must consider the form in which they may be meas- 
ured and the procedures appropriate for measuring them. You may recall that 
three forms of symbolic expression may be used to characterize the status of 
a dimension. These are description-classification, ranking, and scaling. Both 
classificatory symbols and rank symbols may be assigned to any of the di- 
mensions of performances and products. Description has particular use for 
such complex general dimensions as discrimination, timing, and coherency, 
and for many of the special ones: for example, composition and sympathy 
for subjects (art), interpretation (music), correspondence of product to plans 
(shop), budgeting (home economics), and co-ordination (physical educa- 
tion ) . 

Because both performances and products have definite physical qualities, 
it is possible to use scale nicasuiemcnt somewhat more lor then? than lor the 
dimensions of knowledge and understanding involved in other subjects. Speed 
and rate may, of course, be measured by time scales and such physical di- 
mensions of products as are important may be measured in terms of feet, 
pounds, and color spectrums. In addition, error counts (accuracy) may be 
considered scale numbers if the errors may be assumed to have equal impor- 
tance. Scale measurement is, of course, widely employed in physical educa- 
tion for the dimensions of phvsical condition and/or some of the dimensions 
of physical performance and athletic skills, e g., lifting, running, jumping, 
throwing, and climbing. The applicability of scale measurement to sports is 
dependent upon the nature of the sport. 

Observation and Product Analysis. Of the types ol measuring procedure 
discussed in Section 1, observation seems to be the most appropriate for 
measuring performance, and product analysis is, of course, the natural pro- 
cedure for use with products. The nature and proper use of observation and 
reliable ways of appraising products are discussed thoroughly in Chapters 4 
and 5. The principles developed there are completely applicable to subjects 
now in question, and, moreover, many of the examples in the e chapters have 
to do with art, industrial arts, physical education, etc. Consequently, it is 
thought unnecessary to repeat the discussion here Instead, wc should like to 
deal with a very specific technique of measurement that is especially relevant 
to the subjects we are now discussing. This is the “performance test.” 



PERFORMANCE-ACTIVITY AREAS 


339 


Much of evaluation in the performance subjects is and should be based on 
observation of students at their regular work and on measurement of the 
things they produce in the ordinary course of instruction. If this is the only 
basis for evaluation, however, there is much room for error. Measures of dif- 
ferent pupils arc not always comparable, some aspects of achievement will 
unwittingly be stressed more than others, and some pupils will receive more 
attention than others. \o supplement this evaluation through incidental ob- 
servation, and to replace it at times, the use of a “performance test"’ is advis- 
able. 

THE PERFORMANCE TESI 

A “performance test” is no more than a special instance of observation 
or product analysis planned so that given dimensions may be measured under 
given circumstances. Some task is specified that will require the pupil to en- 
gage in essential operations The teacher watches him closely or scar.s his 
product closely and records some measure of his perfoimancc. This is usually 
in the form ol a rating on some classification scheme, but it may be an indi- 
cation of time, an error count, or simply a description ot the performance. 
The advantage of a “performance test” over observation of ordinary perform- 
ances and appraisal of ordinary products is that it yields measures more 
comparable from pupil to pupil and it permits the measurement ol desired di- 
mensions under controlled circumstances. 

Inhere arc at least three basic t\jKs oi “performance tests.” ' The student 
may be required to. 

1. Identify or recognize the proper pioceduic or the proi^er tools or parts 
in carrying out the performance 

2. Carry oui the performance under simulated conditions or in miniature. 

3. Carry out a single task thai is typical of the over-all performance from 
which it was drawn 

Recognition Tests. In tlic recognition t>pc of performance test, the stu- 
dent is confronted with a task given either oially or in writing and is asked 
to identify the proper procedure or the correct tool or piece of equipment to 
be used in performing it In illustration, a student in a w'oodworking class 
might be asked to imagine himself faced with the task of cutting a groo\e ot 
given dimensions in a piece of wood He would then asked to tell what tools 
he would use and how he would proceed. Another example would be in an 
electrical shop where the student is asked to identify the pioper electrical wire 
splice from among ’-cvcral splices. 

Since the student is not asked to carry out the actual performance, this 
type of test is at best an indirect prcKedurc fur measuring performance The 


•‘These three types of performance tests follow the suggested classification piesented 
by Ryans and t recienksen ( 14 457-461) 



340 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


recognition procedure is useful only in performances where several alternatives 
for tools and for procedures exist and where the actual manipulative move- 
ments are routine matters that everyone can do. The recognition type of per- 
formance test is particularly useful, then, for activities where choice or dis- 
crimination is the crucial dimension and where other aspects arc relatively 
unimportant. 

The Simulated Performance. The second type of performance test is the 
simulated performance. Here the student is asked essentially to ‘‘go through 
the motions” of a performance without actually carrying it out. Often this sort 
of lest is used for measuring performances that involve expensive or inac- 
cessible equipment. The armed forces, in particular, have made liberal use 
of the type. For instance, to observe the actual performance of a bombardier 
would require the use of an expensive bomber and the time of several crew 
members, to say nothing of gasoline, maintenance, etc. Consequently, the Air 
Force has developed mock equipment on which the bombardier simulates an 
actual bombing operation, cind his performance is measured in this simulated 
situation.*^ 

Basically, the simulated test requires the selection of the essential activities 
involved in a performance and then the provision of means for duplicating 
or simulating these activities where they may be easily observed. The clTcc- 
tiveness of the test is heavily dependent upon the degree to which the actual 
operation is simulated. Even at best, however, the simulated test contains some 
element of artificiality. It is likely that this test will always be used to some 
extent whenever expensive machinery, time, convenience, and safety arc over- 
riding considerations. 

In schools, the use of the simiflaled test is not extensive, largely because 
performances involving expensive equipment are not a usual part of a school 
curriculum. The simulated situation is employed mostly in subjects where it 
is difiicult to observe a performance closely as it actually occurs. Examples 
of this arc afforded by “shadow boxing’' and swimming out of water. In 
shadow boxing, the boxer contests an unseen op[)onent, thus allowing the in- 
structor to observe footwork, use of arms, head, and body, and general form 
without the distraction of blows and the opponent. Likewise, if a sw'inimer lies 
on a low stool on his abdomen and then simulates a swimming stroke, the 
instructor can observe his co-ordination of arm, leg, and head movement 
much more exactly than he can when the student is in the water. 

The Work Sample Test. The third type of performance test is called a 
“work sample” test. Mere the student is asked to carry out a task typical of 
the over-all performance from which it was drawn. Of wide application and 
an extremely flexible procedure, this type of test has much to commend it 

A model representing the earth s ten am m the vicinity of the bombing area is con- 
structed. A radar screen using sonic waves ipstead of electrical waves depicts the earth's 
terrain as it moves over the model. Instruments simulate the movements of the plane 
and the bombardier uses this information to determine the bomb release point. 



PERFORMANCE-ACTIVITY AREAS 


341 


for use. The sample task may be carried out under actual conditions, hence 
the performer is more inclined to feel that his skill is being tested under real- 
istic circumstances. If the sample has been carefully selected, the test may be 
a valid indicator of a student’s ability to perform the activity as a whole. 

The “work sample” is the type of performance test most ccmimonly used 
in the schools. Therefore, the remainder of our discussion will deal with the 
use of work sample tests in the various performance subjects. Although such 
tests often arc used to determine aptitude for certain activities, we shall con- 
cern ourselves exclusively with achievement. First, we shall outline some gen- 
eral principles that govern the construction of this type of performance test. 
Following this, we shall offer some examples of the tests as applied to specific 
activity courses. 

Construction of Performance Tests of the Work Sample Type 

The usual steps in constructing a performance test arc to: 

1. List all the specific activities in the performance that the lest is to 
measure. 

2. Select the activities that arc to be included in the test. 

3. Develop a task or series of tasks that incorporates these activities and 
manifests their dimensions. 

4. Develop an observation form for measuring the aLlivitics in terms of 
their important dimensions. 

5. Develop instructions, directions, and an over-all plan for administering 
the test. 

.ion ANAIYSIS 

The first step is essentially a “job analysis.” Here the teacher needs to jot 
down or at least to review mentally the significant activities that the perform- 
ance involves In selecting from this paienl list the activities to he tested, 
the .second step, the following criteria arc suggested. 1 he activities should: 

1. Represent the whole perfoimancc as accurately as possible 

2. Be crucial in nature and have widespread efiect on the qualilv of per- 
formance 

3. Reflect the emphasis given in instruction 

4. Embody the dimensions that meet the essential conditions of measur- 
ability. 

5. Require minimal time and expense. 

OFSIGNING IMF lASK 

After the essential activities have been carefully selected, the third step 
is to design a task that incorporates these activities. This is just the reverse of 
what happens in an actual situation. Ordinarily, a task needs to be done and 
the activities occur in response to the requirements of the task. Here, how- 



342 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

ever, we have a list of activities that we wish to observe and we develop a 
task that permits their observation. Consequently, the appearance of arti- 
ficiality must be avoided and the task should be made as real as possible. 
This task, designed to incorporate the representative activities, is now the 
“work sample.” 

PREPARING THI OBSERVATION FORM 

The fourth step, developing an observation form tor recording measures 
of the work sample, is a crucial one. Belore this form is developed there needs 
to be careful anal>sis and an isolation of the measurable dimensions of the 
activities involved in the work sample. The list of possible dimensions of per- 
formance suggested earlier in this chapter should be helpful in this analysis. 
For these dimensions or aspects to be measurable, they should be clearly de- 
fined and manifest easily observable variations. The observation form is no 
more than a list of the dimensions to be measured and a formal provision for 
recording variation relative to these dimensions. 

Most pciiormances involve stages or phases, which arc their elemental di- 
mensions. This is true in such performances as cooking, scuing, metal work, 
and mechanical drawing In cooking, the stages might be the assembling of 
ingredients and equipment, the mixing of the ingredients, and the cooking. In 
mechanical drawing many phases may be identified, some ol which are the 
planning stage, including freehand sketch and layout plan, the eiiecution stage 
in which a pencil drawling is made, dimensioning and labeling, and finally 
inking. The observation forms for such performances should contain these 
stages or phases (the elemental dimensions). In addition, they must provide 
for an indication of the presence or absence or status of these elements. 
Usually the status of the elements is indicated wdth a lefcrence to given prop- 
erties or subdiinensions of the elements, such as speed, accuracy, etc. In such 
case the observation form should provide a scheme for recording variation 
for each of these subdimensions. 

Observation foims arc classified according to the type ol variation they 
record and the manner of recording it. The three basic types are cheek lists, 
descriptive or graphic rating scales, and anecdotal lorms. Each type may be 
found in many dilTerent forms, depending upon the dimensions being meas- 
ured and the nature of the activity being observed 1 he characteristics and 
uses of the three types are discussed fully in Chapter 4. 

Check Lists, \ he check list form is particularly useful for measuring 
the accuracy aspect of performance. For instance, a teacher may prepare a 
list of procedural steps that should be followed. Each step is provided with 
a “yes-no” response on whether or not the step was satisfactorily performed. 
Accuracy of procedure is then directly measured by the total number of steps 
satisfactorily performed, or measured inversely by the number of errors com- 
mitted. In other cases a check list is used of all the common errors of pro- 
cedure or errors in following instructions, and the fewer the errors committed, 



PERFORMANCE-ACTIVITY AREAS 


343 


the more accurate is the performance. If a particular step or type of error is 
more important than others it should be weighted accordingly 

The following is an example of a check list type of observation form; 

( heck List for Pattern Cutting in Sewing 


Yes No 

1. Assembles neccssaty material and equipment 

2. Lay out patterns on cloth for optimum use c'f lIoIIi 

3. Adequate pinning of patterns on cloth 

4. Proper use of scissors 

5. Cutouts cJosel> approximate pattcin shape 

6. Proper allowances made in cutouts 

Rating Scales, i he second type of observation form is one that records 
several g^padations of performance. In this form the '‘yes-no" or occurrcnce- 
nonoccurrcnce responses of the check list are replaced by descriptive levels or 
graphic ratings. In the check list form, just a check, a number, or the words, 
‘'yes-no” appear, and what constitutes satisfactory or unsatisfactory perform- 
ance or the occurrence or nonoccurrencc of an error usually is present only 
in the mind of the scorer. In the rating scale, on ihe other hand, a description 
of what constitutes each level of pcriormance may appear on the form for 
everyone to sec Fhis makes it possible tor scorers to check one another more 
closely and for the poisons being scored to find out more exactly the nature 
of their performance. In some “scales" these descriptions are omitted, grada- 
tions of pcrloriiumcc being implied and s)mbolized by units of a line or by 
numbers in a series This usually is a less satisfactors^ ivpc than the descriptive 
scale. 

A brief example of a descriptive scale tor pcrloimancc m throwing a pass 
in football is provided as Mlov/s* 

f.t'xel /. Shows poor loiin Ball tails shoil or ovcishoois. Not balanced and has 
little sclf-assur.uicc. 

Level IL Body is ^^cll balanced Opposite loot point, propcily and hand is in 
proper place on the ball Moves are somewhat jerky and self -conscious Watches 
receiver too long. Ball tiavcis in a wobbly manner but manages to get to re- 
ceiver. 

Level III. Body is well balanced. Opposite foot points to where ball is passed. Hand 
is in proper place on the ball. Moves with case and assuiancc. Doesn’t look 
directly at receiver until the pioj^cr moment Ball travels spirally and is ac- 
curate. 

In our later discussion of evaluation, it will be seen how easily the dcsciiptive 
level type of observation foim, such as the one above, can be adapted to mak-’ 
ing evaluations ot performance. 

Anecdotal Records. 1 he third type of observation form is the anecdotal 
form. I his is the most inforirial type of observation, since it consists cssen- 



344 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

tialJy of a blank sheet of paper on which the teacher records as objectively as 
possible what is observed in a performance. This type of form would be par- 
ticularly useful when the performance situation is new to the teacher and 
somewhat fluid so that no clearly defined dimensions have as yet been estab- 
lished, Where description is to be the form of measurement, the anecdotal 
form obviously is the only appropriate recording device. 

Many varieties of these three types of observation forms arc cither avail- 
able or can be developed. Chapter 4 provides examples of the possible varia- 
ations of rating scales. It cannot be emphasized too much that the develop- 
ment of an cfTcctivc observation form is a very crucial step in the construction 
of a performance test. 

r>r.AN FOR ADMINlSrRATrON 

I'he fifth and final step in the construction of a performance test is to 
develop instructions, directions, and an over-all plan tor administering the 
test. This part may be as formal or informal as the circumstance demands. 
In some situations with small groups, the test may be administered informally. 
A more formal situation, however, may be required for large groups. In some 
cases it may be desirable to provide for recording incidental observations or 
things that happened during a performance that might have bearing upon an 
evaluation of the performance. 

Whether the performance test is administered under fornud or informal 
conditions, certain requirements of good lest administration should he met. 
I'he test situation needs to be standardized as carelully as possible with respect 
to time allotments or allow'aiiccs, sequence, warm-up period, placement of 
tools, equipment, materials, the nlimber of trieds to be allowed, etc. There 
need to be clear-cut instructions so that the tasks arc fully understood by the 
examinees. Furthermore, no variation in these instructions should be per- 
mitted, otherwise some examinees will be given an advantage over others. 
Also directions need to be provided for the person administering the test as 
to where he is to stand, how he is to behave, what he is to sa>, what specific 
observations he is to make, how he is to record his ratings, etc 

These instructions and directions comprise the over-all operating plan 
for administering the performance test. The plan should be written out when 
the test is complex or where large groups of students or several examiners are 
involved. If il is a simple test and few students and only one teacher are con- 
cerned, an outline or notes may suffice. 

Examples of Observation Forms and Performance Tests Used 
in Specific Subjects 

To conclude our discussion of measuring procedures in the performance- 
activity subjects, we want to present some examples of what teachers actually 
do when tht*y measure pupil performances and products in art, music, shop, 
etc. Some of the examples are simple forms that may be used in incidental 



PERFORMANCE-ACTIVITY AREA9 


345 


observation as well as in performance testing. Others are performance tests 
per se. Since the devices illustrated were designed by teachers having only 
basic training in measurement, they are imperfect in some respects. However, 
it is thought that such '‘icalistic examples” may be of more help to the be- 
ginner than would be idealized ones. Important errors will be indicated and 
improvements suggested 

ART 

The form presented in I able 23 is t>pical of what might be designed by 
an art teacher to measure performance in a high school freehand drawing 
class where pencil and charcoal arc used. 

lABLF 23 

Ratmir C harl for Ait Pcifoimance 
(Fieehaiul Drawing Class) 


I O I C' I B I A 

J Dravring i 

a Vccuracy of pioportion or 
S iiitibilit> of disloition 

h. Rclitionship of proportions i | | 

c Sta[nl!t> of subjLcts I 

a r \sc of intcrpi elation i 

2 ( ompoM'lion 

a Bctlance I 

b RIntlim 1 

c Spati.il relations t 

d rextuial inteie^t j 

I , 

^ 1 cel foi Medium | i 

i 1 me qualit> i 

b Tone qualit) I 

I 

4 Subjcet Mattel I i ' 

a Intcust ' I 

h Ariaiigcmeul ' I i I 


to V a nations 

13 — Drawing shows no regard toi asp«-et being judged 
C — Aspect not well utili/ed 

n__ Aspect notewoith\, but room tor improvement at this grade level 
A — Aspect adds mit^iiallv to the excellence of the pictiiie 

The rating chart in Table 23 is applicable to the product ot art perform- 
ance rather than the process. An cvccllent beginning has been made but the 



346 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


levels of pertormance need to be more accurately described Notice that gen- 
eral dimensions, drawing, composition, etc , are analyzed into their funda- 
mental components 

DRIVING 

Table 24 illustrates one wa\ of rating “bchind-the-wheel” performance 
in a driver training class It is basically a chcck-list type of form, although it 
permits three-valued ratings 


TABU 24 

Check list for Dnvmg Pcitoniiricc 


4\pict uhs(t \td 


ClllSMjjC LltlOflS Iff ( fu L ti\ ( tu ss 


Student posture 

V LS 

Ql r S 1 ION MILL 

NO 

Seat adjustment made 

Y1 s 

gi tsi ionahi 1 

NO 

Mirror adjustment made 

\LS 

QULS110N\HLL 

NO 

Foot position (dimmer switch and 




accelerator) 

COIUU < 1 

I AR COKRIC T 

INCORRCC 1 

Hand position (10 and 12 o clock) 

( ORRFC I 

1 AR CORRLC I 

FNC ORRFC I 

Posture (erect md behind wheel) 

CORRK I 

I AR ( ORRFC 1 

INC ORRFC I 

Putting eiutomohtk in nwHon 




Releases handbrike 

SIS 

1 FSIIONSHI 1 

NC^ 

Starts auto forwaid 

SMOOIIII \ 

1 MM M Y 

IFKkll A 

Shifts gears (low to second) 

Qi ir 1 1 \ 

sO\lf NOISr 

C RJNDINC 

Shifts gears (second to high) 

Ol IITIV 

SOMl NOISL 

( RINDINC 

Steering in road 

niRre 1 

WEAVIN 

VSSIST vNC 1 




RFgi IRl 1 

Bringing automohilc let a stop 




Puts hand out and down 

l R1 • ISl 

L NDI RsIAND 

1 NIDFMI 



viu r 

1 lABi I 

Slows cai down 

sNuxnui V 

1 M vr M V 

II RKIl > 

Brakes to a stop 

SMOOT HLy 

1 NEVI M Y 

11 Rkll ^ 

Sets hand bnkc 

\LS 

or I ‘ r roNAHi f 

NO 

Parks car ofT pavement 

MS 

Ol FSIIONMU 1 

NO 

Showitifi eonsuh ration for others 




When pulling out from cuib 

MS 

(K I >1 IONAULI 

NO 

When stopping 

s 

gi I ST lONAHLJ 

\o 

Shows respect for lights of othcis 

vr s 

Ol [ snoNABl I 

NO 

When in question as to otheis 




rights, relinquishes his 

^rs 

gnr snoN vbli 

NO 

Response to other diivcrs" signal 




and tolerant of their errors 

MS 

ol I STIONArii 1 

NO 



PERFORMANCE-ACTIVITY AREAS! 


347 


MUSIC 

The test described in Table 25 represents an attempt to measure per- 
formance in playing various musical rhythm patterns and is lor use with be- 
ginning instrumental students. I he test covers all the common rhythmic pat- 
terns through eighth note^ and synct)pation in 2 -^4, 3/4, and 4 4 time 


lAHLI :s 

Musical Pei form mcc 


Rhythm Patterns The ilivthni patterns aie gioupcd into toe levels of tlilliciilty, as 
Jollows 


L All possible combinations of J ^ , and ^ . 

II. All possible combinations of J J , and ^ 

III, All possible combinations of , and ^ 

IV. All possible combinations of jjj and J. 

V Ml ot the above ilK>ns pUl^ svtKopattd piikins 


1 or example, at the lust level vve have the lolJovviim possibilities in 4 \ tune wiihoiil 
using the syncopated pattern 


4 ^ 

J J 

^ ^ . 

^ ^ ^ . 


1 

1 ach pattern is uii 
of difhcully and the p ii 

itten on a 

ttern numb 

^ 

3 

laige taiil, on tt 
Cl as shown b<. !o 

(IF 

ic had ol whki 

w 

r 

h iccoidtd ihe Icvei 


I xamjile 



I 3 


Front 


Back 





348 


CUSTOlilARY USES OF MEASUREMENT AND EVALUATION 


TABLE 25 {Continued) 


Administration of test: The teacher places one of the cards on the music rack, 
establishes a tempo by tapping out one measure, and the student plays the pattern on 
his instrument at some convenient pitch, repeating the pattern several times. Each per- 
formance is rated according to the following variation scheme: 

Rating Description of Pei fornmnee 

1 Plays correctly on first attempt without hesitation. 

2 Plays coirectly on fiist attempt, but after first analyzing or thinking it through 

'1 Plays correctly on second attempt. 

4 Plays correctly after some stumbling and fumi hng. 

5 Is unable to play the pattern coiiectly. 

The teacher recoids each peiformance by identifying the pattern number and the 
lating for its performance, as follows: 

Pattern Number | 1-4 ' 

Rating 


This performance test attempts to measure tv\o dimensions simultane- 
ously, first, the level of complexity of the pattern and, second, how quickly 
the pattern was played accurately. Since the dimensions have been clearly 
identified and their variations carefully defined, the ratings on this perform- 
ance test should be reliable. Because it is to be individually administered, the 
test would probably consume considerable time unless some sampling scheme 
were used. 

INDUSTRIAL ARTS 

In Table 26 is an observation form that may be used to rate a shop per- 
formance, applying an oil varnish. This form illustrates the measurement of 
a performance in which detailed procedural steps are gi\cn the most attention. 
Since some of the steps may be more crucial to the fin.il result than others, 
these should be given additional weights it an over-all score is to be given 
to the Student’s woik. An iinsatisfactoiy aspect of the torm that it provides 
for immediate evaluation of pioccdures rather than their measurement and 
yet docs not define what is a satisfactory as against an unsatisfactory condi- 
tion. 

TYPING 

The plan in Table 27 is used to rate a student’s ability to set up an out- 
line in proper form. Students are given copy and told to type it according to 
specific written directions. 




PERFORMANCE-ACTIVITY AREAS 


349 


TABIF 26 

Rating Form for Applying Varnish 


Pai tl V I 

IJnsatis satis- , Satis- 
Action ohstmd factof\ factors I factor \ 

I Preparing surface 

1 Cheeks dryness 

2 Removes dust, using suitable cloth 
1 Removes grease oi wax 

II Getting the varnish ready 

1 PoLiis only enough varnish foi job 

2 Does not pour vunish back into can 

^ Checks \arnish flow and lakes conectrve steps 
if necessary 

III Applying varnish to wood 

1 ( hecks room temperature and ventilation 

2 Spiinkles lloor to lav dust 

3 Checks clothing for dust 

4 Selects brush of suit ible si/c 

5 Checks brush for cleanness and loose biistks 

6 Dips brush into v irnish about ' ^ the length 
of the bustles 

7 laps blush lightly iiiMdc of cont iinu 

8 I lows varnish on surface well 

9 Stalls at center of surf tee and biushcs out 
toward the edges 

10 Wipes excess varnish from the brush over the 
edge of the container 

11 f vens up the surlace with light Ritheiing 
strokes 

12 Works with giain 


I \BT 1 ^7 

Plan for ludging a Typing PeifoimiriLe 


1 \catnL\s~ 20 points 

Point'* will be Mibtraelcd if t '< ^ollowine requiiements .irc not met 
a Fven touch to stioking of keys to avesrd light and d uk letters 
b No strike-overs 

c Clean appear nice ot paper with no smudges, finger marks creases, etc 
d Good geneial appeal incc 

2 Acciiracs — SO points 

a Typographical cirois will be penalized on the basis of two points off for every 
ei ror 

b Fotal points subti acted for typing eriois shall not exceed ^0 



350 


CUStOMARY U§ES OF WffiJgWJREMENT AND EVALUATION 


lABLt 27 (Continued) 


3 Diredions I olUmtd — 20 points 

a Use of plain 8 Vi" x 11" paper 

b Right number of spaces foi lop maigin 

c C oriect placement and capitalization of heading 

d Piopcr number of spaces between heading and first line as well as correct 
spacing between lines for the remainder of the outline 
e Indent itions propcilv placed in fi\c spaces 
f C orrcct mimb«.i of spaces after all periods 
g No sti ike o\t IS 
h No el isures 

4 (amplttion / mkisl 10 points 

a Icn points <ilIowed if < nlirel> completed 

b Adiiist this pait of the scoic comincnsuiatcly with amount hnished 


I he reliability ot this type ol point rating is low because ot the subjective 
basis lor assignment ot points to dimensions i and 3 ilowc\er, the iorrn does 
show a good operational analysis ol the factors in a taping pcrtoimance 

llOMF K GNOMICS 

I he rating outline in lable 2h is used m a home economics class and is 
similar to the one tor Uping I he dimensions ot a prc^duct are broken down 
and arc to be assigned points on a inoic or less siibicctive basu As with the 

lABir .8 

OulluK IcM Rating i Ciatheitd Skill 


{ \ a.h ot the vight groups i"* woill 5 points ) 

1 SiJilabiliU of malciMl to Ihu l>pc ot stiil 

2 Direction ot cut suit thlt> to the pittun of the milwial 

s Scam construction 
a Scams straight 
b ThicaJs tied 
c Propel width { ' 2 a ) 

4 Side opening 
/ ipper 

a I ICS flat 
b Stitching straight 
c Open*^ md closes casilv 
Continuous lap 

a. Proper width in rtlation to wcigtit ot matciial 
b Lies smoothly 
c Not wrinkled 

5 Gathering threads sewed in e\eiil> md pulled up e\enh 



PERFORMANCE-ACTIVITY A 


351 





TABLE 28 (Continiud) 


6 Belt 

a Not stretched to one side or olhei 

b. Stitched evenly 

c. Buttonholes centered 

d Buttons sewed securely, fastened nc itly 

7 Hem 

a Lies smoothly 
b Even width 

c Stitched neatly and scciiiely 

8 Fit of skiit 

a Waist band pioper size 
b Skirt length correct 
e Skill hangs evenly 


typing scale, the basis for assignment ot points is loo indeterminate Inst how 
would a hem look to be rated 4 points rather than The lorm would be 
improved by inclusion ot described gradations that arc worth so mans points 

PHYSICAl IDUCAFION 

A plan toi evaluating a dive is shown in 1 able Any dive has been 
broken down into five components with i dvsciiption ot what constitutes 
20od procedure lor each I he dive is rded at cmk oI hvc levels iccordipg to 
the diver’s conformance to the prescribed proceduies 


I \bl \ 2^) 

PI m U)i 1 ^ ill Uinu: i Di\mg PciU)'*in incc 


'spccls I^ivjiig Pertoiin 


Readiness 

Balanced, crec^ well poised 


\pproach 

Steady, easy, smooth cn.ct, ki-il nun t ci >1 s'cps 


r ikcofl 

Piopcr position on bond lump is timed wi ti bond ci is«-ful ind t<kcti\c use 
of arm and K movement, bodv eicet 


I light 

SulTicicnt height for necessary movements, body movement-, eonfoim U' peeifiea 
lions of dive (jackknife, twist, I ivoiit fold, turns), movements are smooth, easy, 
and graceful 



352 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


TABl L 29 {Continued) 


Water entry 

Body perfectly straight with fect together, m appioximate line with diving board 
minimum splash and sound 

Rating Seale for Ovei all Diving Peiformanee 

Li Vi I Disc tipt ion 

1 Dive IS only b irelv reeogniz tbic Basic eiiois eomniitted thioughout No appiicnl 
control 

2 Some contiol is appaient but dive is still in idequate M my errois made 

3 Minimum lequircments ol dive are met but me vemcnls are still somewhat jerkv 
Lacks contiol in some respects 

4 Major requirements aie met with gtnci il impitssion of good control and foim 
some minor eiiors and deviations still exist 

5 All aspects of dive skillfull) and gi leefiilly executed with no appirent variations 


The scale could be improved by relating each level explicitly to each ot the 
components of a dive As it is, any rating is a quick and subjective “a\erage 
of performance tor each ut the components 

SPEECH 

While the measurement of speech is ticatcd in Chapter l(k speaking is 
an activity subject and it atlords an example ot a novel and care! Lilly worked 
out perlormanec test The test outlined m Table TO is designed for use in 
the primary grades and is to be idininistered indixidually 

JABl h 30 

A Pietuie Aiticulation lest foi Nonieideis 


Errors in articulation consist of the foIK»vving thiev tvpes 

1. Omission of sound 

a Example says ca” foi cat 
b Record error as follows o/t (omits /) 

2 Distortion of sound 

a Example an s that is whistled (sound is sloppy oi mieeiiiate) 
b Record enor as follows /s (the s is disioited) 

3 Substitution of sound 

a Example says wabbit for r ibbiC 
b Record error as follows w/r (substitutes u for r) 

The place of the articulation enoi in the woid mispronounced is usu dly indicated 
is follows 

1 — Initial position 
M — Medial position 
E — Final position 




353 


PERFORMANCE-ACTIVITY AlU^AS 

TABLE 30 {Continued) 

Examples: 

th/s(F) — substituted th for s in the final position of the word. 

/s(T) — distorted the s sound at the beginning of the word. 
o/s(M) — omitted the 5 sound in the middle of the word. 

Procedure for administerinft the test: 

1. Engage the child in spontaneous convcisation and record articulation errors that 
arc noticeable in this type of speaking. 

2. Show child the pictuie cards in sequence. Do not sav the word lor the child. 

3. Rate the child on the rating scale of seventy. 

Picture Cards 

The following is a sample of the picture cards used to elicit the necessary sounds. 
Alongside each pictuic is the word that the picture should elicit, and underlined are the 
sounds being obseived. The words, of course, would not appear on the actual p :ture. 



Rating scale of se\ej’'\ >; tniscutn idation 

Diiections- Find the category that most neaily applies to the child’s speech and 
record the category number If two e.ilegories seem to apply, record both cate; ory num 
hers and underline the one that scorns to be slightly moie pertinent to the case. 

r xample: 2-3 would mean that the child is on the boiderhne between No 2 and No. 
3 categories, with a slight edge given to the latter 

Catcftory Description i>t Misarth idation 

1 Mild, but noticeable defect. May be considered normal unless accompanied by 
psychological or etiological factors. Liror in I or 2 sounds 

2 Moderate disability Notic ' ih'e deviation, but docs not interfere with normal 
ccnnmunication. Prognosis: good. Probably amenable to brief speech therapy. 

3 Obvious articulation defect Fric^is in three or more sounds Gives evidence 
that speech disability has become habitual and will require special therapy for 
re-education. Immediate thciapy needed. 

4 Severe articiihition pioblem Immediate theiapy receded. Speech is so notice- 
ably deviating from normal that the child suffers psychologically in his class- 
loom or at home. Prognosis may be pooi or good for improvement. 

5 l^xtremely severe defect. Immediate need for therapy. Speech indicates need 
for intensive extended theiapy. Prognosis may be poor 



354 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

All the essential phases in performance testing are embodied in this test, 
namely: an identification of the dimensions of the performance, development 
of a task that provides for observation of the dimensions, development of a 
rating scale and a recording system, and finally an over-all plan for adminis- 
tration of the lest. The rating scale in particular can be improved by replacing 
such qualitative ^^ords as “mild,” “moderate,” “obvious,” “severe,” and “ex- 
tremely severe” by a more exact description of the actual nature of the articu- 
lation error. 

This completes out illustrations ol specific observation forms and per- 
formance tests It IS hoped that the examples arc suflicient to suggest what is 
involved in their development By now it should be clear that a great deal of 
thought and experience must go into the construction of an adequate observa- 
tion form and/or performance test. Unfortunately, the work that is involved 
has discouraged many teachers. However, when one considers the alternative 
of depending upon subjective and unsystematic appraisals and the negative 
effect the latter may have upon pupils and instructional efficiency, it seems 
that any time and effort spent on developing valid observation forms and per- 
formance tests is entirely lustificd 

Evaluative Standards and Problems of Reporting 

In general, evaluation of either performance or products should be gov- 
erned by the principles described in Chapter 9. There should be ^ comparison 
between the performariec or product and a known and appropriate standard 
for either. The standard should not only lake into account the intrinsic char- 
acter of the action or artifact but iilso the age ot the pupils and the purpose 
of the course: that is, general education, vocational preparation, recreation, 
or whatever. Ihe standard should express clcaily defined levels or gradations 
of quality for each of the dimensions to be evaluated. Evaluative symbols 
assigned to the performance or product should be clearly related to the stand- 
ard. Periodic reports on progress that cover many performances and products 
should be based on comprehensive measurement, be clear both to pupils and 
to parents, be somewhat diagnostic, and reflect consistency as to standard both 
among teachers and for the same teacher from year to year 

A basic consideration in evaluating achievement in such subjects as music, 
home economics, etc . is that the dimensions of performances often have im- 
mediate implications for the value of the performance As there is increase 
in such general dimensions as accuracy, speed, discrimination, economy of 
effort, and coherency, there usually is an increase in the worth of the perform- 
ance. And as there is decrease with regard to these dimensions the value of 
the process typically decreases Moreover, where there are prescribed ele- 
ments or steps (the elemental dimensions) in a product or activity, their in- 
clusion generally adds to the value of the performance and their omission 
subtracts from its worth. For example, in applying paint it is necessary first to 



PERFORMANCE-ACTIVITY AREAS 


355 


prepare the surface by cleaning and/or sanding. A check that this has been 
done means that the painting task has been done so much the better. 

For this reason an evaluative standard frequently is incorporated in the 
rating scales and other observation forms used in the performance subjects 
and hence evaluation and measurement are accomplished in a single act. This 
was characteristic of several of the examples of observation forms we just 
presented. Notice in particular the Check List for Driving Pertormance, the 
Rating Form for Diving, and A Picture Articulation Test for Nonreaders. 
Such union of measurement and evaluation is eflicient and desirable as long 
as it is a planned procedure and as long as the appraisal is based on observa- 
tion of the specific performance or product and not on the tcachcfs general 
feelings about the pupil In evaluating, somewhat more than in measuring, it 
is relatively easy for a teacher to let his over-all opinion of the pupil ailed his 
opinion of a given performance 

Earlier we asserted that perhaps the best type oi observation and product 
analysis form was a descriptive or graphic rating scale The scale contains 
brief descriptions of ditferenl grades or levels of performance. The pupil's 
performance is measured by noting which of the levels his behavior approxi- 
mates and by assigning to him the symbol that represents that level. 

Such a rating scale lends itself readily to combined measurement and 
evaluation. All that is required is that the value of each gradation of perform- 
ance recorded on the scale be established and a symbol used that signifies 
this value When the levels in the scale have been assigned their appropriate 
values, the scale then becomes an evaluative standard as well as a measuring 
device. 

Although there may be a dose affinity between the statU'' ol a perform- 
ance with respect to certain dimensions and the value of that peiforinancc, 
there is no automatic correspondence Appropriate assignment of values to 
different levels of pcrioimancc must lake into account the age of the pupils, 
the amount of instruction given them, the relationship of the given course to 
earlier and later courses, etc 

There arc many publications to a>sisi the teacher in defining valid stand- 
ards for performance-activity subjects. Methods tcxlb )oks in the several 
subjects contain certain standards and provide the basis for devising others 
Courses of study often contain them and many have been prepared and pub- 
lished by professional associations and committees representing given subject 
areas. The teacher should, of .^irse, not depend solely on books for his 
standards. He should accLiinulatc appraisals of typical perlormanccs and prod- 
ucts over a period of time and continually study the capability and motiva- 
tions of his pupils 

In addition to these general consideratioas, evaluation in several ol the* 
performance subjects involves special problems. Some of these arc discussed 
in the following paragraphs. 



356 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

Art. Three factors seem to constitute the principal difficulties in art 
evaluation and these must be circumvented or reconciled in some way if art 
evaluation is to be effective. 

1. Excellence in drawing, painting, sculpture, etc., seems to necessitate 
special talent. Students in art courses differ widely in its possession and con- 
sequently any comparative grading or any use of absolute standards is certain 
to make for invidious evaluations. 

2. All their lives children have viewed the stereotyped art of magazines, 
greeting cards, comic strips, textbook illustrations, advertising, and cartoons. 
Necessarily, their efforts at drawing and painting may imitate these. Many art 
teachers, on the other hand, have been taught to avoid stereotypes and to 
seek for original expression. Consequently, some pupils who are good crafts- 
men but arc too much influenced by stereotypes may be judged adversely, 
while some “original” spirits may be commended despite many errors in 
technique. 

3. rherc are several “schools” of art, each with its own philosophy, style, 
media, and subject matter. Some teachers are prone to teach in terms of a 
“school” and to evaluate by its standards. Pupils who arc taught in successive 
classes by teachers representing divergent art philosophies may receive what 
they consider to be unfair evaluations in the second class 

Music. In music, the musical composition as written constitutes a sort 
of ultimate standard for any singing or instrumental performance — together, 
of course, with the accepted conventions of tone, rhythm, harmony, etc. As 
a student approximates the “music” as symbolized on the score his perform- 
ance is good, and as he departs fropi it the performance is bad. P’or such an 
external standard to be used with fairness in public school music classes, it 
must be tempered by the teacher’s knowledge of his students' prior training, 
their age, and their talent, etc. As in art, special talent is a variable that makes 
evaluation particularly difficult. 

The tape recorder makes it po.ssiblc for any music teacher to develop per- 
formance scales that may be used as evaluative standards. Many pupils judged 
as bad to fair to excellent performers can have their singing or playing re- 
corded. From these many recordings a scries can be selected ranging from 
best to worst and each may be assigned a given value. 1 he performance of any 
pupil can be judged then by having him sing or play the same composition as 
is used in the scale. 

Business Subjects. Evaluation of student performance in typing, short- 
hand, machine operation, clerical practice, and bookkeeping is relatively free 
from the problem of subjectivity that besets art and, to a lesser extent, music. 
It does, though, involve two other problems. For one, business courses are 
almost necessarily vocational courses or at least have immediate vocational 
significance and thus the standards of business practice must be considered. 
For the other, in many instances low-ability high school students are assigned 
to business classes simply because they cannot learn the more “academic” 



PERFORM ANCE- AC riVITY AREAS 


357 


subjects. Needless to say, this sort of educational guidance is of dubious value. 

Employers want stenographers who can type and take dictation at given 
speeds and who can conform to certain conventions of letter composition, 
filing, telephone usage, etc. If the teacher is to be realistic, he must relate the 
standards in his class to those of the employers. 

He can do this by establishing a single level or type of performance that 
is acceptable in the business world and then by evaluating performance as 
satisfactory that achieves the level and as unsatisfactory that does not achieve 
it. Or he can devise a performance scale containing several levels of com- 
petence, one of which is acceptable for business practice and is labeled as 
such. Obviously, from our point of view, the use ol the performance scale is 
preferred. In beginning classes, particularly for eighth and ninth graders, the 
minimum business standard might be at the top of the scale or even beyond it. 
Business textbooks and teachers' guides usually are prepared in relation to 
business standards and most of their procedures conform to business prai'tice. 
Hence, if course standards arc derived from textbooks and guides they are 
likely to incorporate vocational standards 

The presence ol low-ability students who have no interest in the subjects 
per sc poses a dilemma for the business teacher. From one point of \iew, 
they arc not likely to attain acceptable vocational performance levels and 
hence should not be allowed to pass. From another, they have to take the 
course, they certainly will learn something, and hence should be allow'cd to 
pass. Some resolution of the dilemma may be attained by use of a dual mark- 
ing system, one standard relating to absolute achievement and the other to 
learning or progress. 

Industrial Arts. Problems of evaluation in the various craft and shop 
courses — drafting, wood shop, machine shop, auto mechanics, etc. — arc 
similar to those in business education. Instruction in this area necessarily has 
vocational significance, even in general shop and in nonvocation il programs, 
simply because the things boys learn to do are what men do for wages. Even 
more than in business courses i- there a tendency for trouble-makcis and low- 
ability students to be assigned U> industrial arts clas.es. As v\ith business edu- 
cation, evaluative standards tor ''shop'’ courses must be related to vc>catJonal 
standards but they should also take into account the spread of achievement 
likely to occur in any class. The use of two standards, one for performance 
per se and the other for learning, and tne assignment of marks on the basis 
of either, seems to he the best approach to fair evaluation of the students who 
are in a class for reasons other than interest and/or aptitude. 

Physical Education. The male physical education instructor in a second- 
ary school is usually the coach of one or more mlerscholastic sports and much 
of the instruction in physical education relate*^ to these sports. This makes for 
one problem in evaluation. A second source of difllculty in evaluation is the 
large size of the usual physical education class and the routines of lockers, 
gym dress, and showers involved. 



358 CUSTOMARY USES OF MEASUREMENT AND EVALUAT^UN 

t 

Because of the iual role of the coach and the “varsity” significance of 
many sports to be played and learned in boys’ physical education classes, 
evaluative practices are susceptible to two inequities. The teacher may be 
tempted to use the performance of his letter winners as the basis for grading 
and thus give low marks to a large number of students. Or he may become 
preoccupied with judging the skill of the better athletes, those who are team 
prospects, and simply not evaluate the others, ‘"giving” them a C for attend- 
ance and obedience. Ihese errors in evaluation may be minimized, if not 
avoided, by the conscious and planned use of written or graphic performance 
standards appropriate to the spread of ability in regular classes, not just team 
classes. 

With large classes, expensive equipment, special dress and sanitary rou- 
tines, a certain amount of regimented behavior is mandatory in physical edu- 
cation classes. If the teacher permits loo many individuals to deviate from the 
prescribed order, chaos may result and teaching suffers. So the need to have 
pupils conform becomes acute and one of the few “motivaters” left to the 
modem teacher is the report card. Consequently, many physical education 
teachers reward consistent attendance, suiting-up, showering, replacement of 
equipment, etc., with increased marks and punish inconsistency in these affairs 
with lowered marks. 

While without question this practice contuses the significance of physical 
education grades, its value tor class control makes it very atiraci;\e to teachers 
It is hoped, of course, that the necessary conformity can be gained without 
the involvement of marks in achievement If it cannot, and the marks are to 
be affected by promptness in dressing, etc., pupils should know just how much 
weight is to be given the conformance factors and this weight should not be 
excessive. An evaluation of the pupil’s physical and athletic achievement per 
se should be clearly indicated to him and his parents by a separate communica- 
tion if necessary. Rather than lowering or raising an achievement grade, it 
seems better to have a separate mark for conformity to routines in physical 
education and, if necessary, to have this separate mark bear on promotion, 
honors, and even graduation. 

Summary 

This chapter was devoted to the measurement of the performance aspects 
of the school curriculum that constitute the major portion of such subjects 
as art, music, shop, driver training, home economics, and physical education 
In the measurement of performance, it is helpful to identify two phases: 
process and product. Most performance processes involve in common the 
dimensions of speed, accuracy, discrimination, economy of effort, timing, 
intensity, and coherency. The measurable dimensions of performance products 
are their physical properties and such elements and qualitative attributes as 
derive from their specifications and purposes. 

The status of performances and products may be indicated by classifica- 



PEpFORMANCE-ACTlVITlfc AREAS ^ 359 

tion and descriptive symbols, by rank numbers, and, for ^rne dimensions, by 
scale numbers. Observation and product analysis obviously are the basic 
measuring procedures to be used in the performance subjects. In addition to 
casual observation of processes and appraisal of products, it is well to use the 
device of the performance test. Such a test, in general, calls upon the student 
to perform or to simulate performing an activity under prescribed conditions. 
The three basic types of performance test arc: recognition, simulated task, 
and work sample. In the preparation of a performance test, five basic steps 
are involved: itemize the specific activities to be covered, select those activities 
to be included in the test, develop a task or series of tasks that permit observa- 
tion of these activities, develop an observation form for appraising perform- 
ance and/or product, and, finally, develop an over-all plan for administering 
the test. The most crucial part is, of course, the observation form. Some illus- 
trations of performance tests and observations forms were presented and their 
shortcomings discussed. 

Evaluation of performances may be accomplished automatically if an 
evaluative standard is incorporated into the rating form. This requires that 
differential values be properly assigned to the gradations of performance de- 
scribed on the form. Among the problems encountered in evaluating per- 
formance arc the importance of special talent in art and music, the influence 
of interscholastic competition in physical education, and the sometime use of 
business and shop classes as havens for slow or maladjusted pupils. 


HXERC1SE.S 

1. Contrast the problems of measuring and evaluating in “activity courses” 
with those of measuring and evaluating in “academic courses.” In what ways are 
the problems similar? 

2. wScIect a standardized pertormance lest and critically analyze it in terms of 
the dimensions being measure 1. 

3. What would you leply to a teacher ot music or art who maintains that 
measurement is impossible in art oi in music because performances in these areas 
are so subjective? 

4. Select a specific performance in your teaching area. Develop a set evalu- 
ative standards and construct a device for measuring the dimensions of this per- 
formance. 

5. How^ would you proceed systematically to improve your performance tests? 

6. Suppose you were called upon to rate a performance with which you have 
had little experience. How would you prepare yourselt lor this task? 

BIBLIOGRAPHY 

1. Adkins, Dorothy, Construction and Analysis of Achievement Tests. U.S. Civil 

Service Commission, chap. 5, pp. 211 26.5. 




CHAPTER 14 


INTELLIGENCE 


In the preceding four chapters we have studied the evaluation of pupil 
achievement in various school subjects and the need lor such evaluation was 
obvious: the pupil must see how he is progressing and the teachers must see 
what they are accomplishing. Now, however, we arc about to be concerned 
with the measurement of something that isn’t a school subject, intellii^ence. 
And it seems appropriate to ask why we should study its mcasinemcnt. 

Intelligence Defined 

According to American usage as recorded in dictionaries, intelligence ij 

. . the capacity for knowledge and understanding, especially, as applied 
to the handling of novel situations; the powei ot meeting a novel situation 
successfully by adjusting one’s behavior to the total situation; . Ap- 
parently, the idea of intelligence is very old and widespread. Plato in his 
Republic wished people to be classified lor vocations on the basis of mental 
differences. His philosophers (lovers ot wisdom) were to be the rulers So far 
as is known, all languages have a word for it and all ages and kinds of people 
arc purported to h<we some ol it. Wc say, “What an uuelh^em child” and “the 
Bushman has less intelligence than we ” Furthermore, by inlciligence ap- 
parently is meant something that the brain does or is. Words like celebration 
and gray matter, phrases like “What a brain!” tend to recur as irtelhgcnce 
is discussed. Finally, wc seem to mean something that changes or grov\s when 
wc speak of a child “becoming more intelligent” and w'hen wc use mental 
age as an index of intelligence. 

What sort of thing is intelligence then? —to be a power or act, ic ^e old 
and universal, to have to do with the brain, to grow, and, particularly, to have 
a thousand or more definitions. In this last eharaeteristie may be the key to its 
nature. Intelligence apparently is an exphinution. an explanation lor the dif- 
ferences we observe among people in their capacity for learning, in their 
remembering, in their problem solving, etc Since an explanation is never 
seen or touched, it is easy to say different things each time we attempt the 
explanation. So it is hardly surprising that intelligence is given so manv varied 
meanings. 

^ Webster's New hiternational Dictionary, Second Edition, Unabridged, Spnngficld, 
Mass., G. C. Merriam Companv 1952 


361 



362 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


Inteliigence is that type of an explanation that deserves the term comtruct. 
You may recall the discussion of constructs in Chapter 2 (pages 26-28). 
They arc the verbal or mathematical “maps'’ that enable men to deal more 
rationally with, and often to control, complex observable phenomena. The 
complex observable phenomena in this case are a very large number of be- 
haviors that seem to have something in common. Examples of these behaviors 
are reading, figuring, searching, writing verse, learning a new language, and 
solving mechanical problems. The construct of intelli^^encc enables us to 
understand individual differences in such behaviors, permits us to appraise 
their relation to one another, and allows us to predict the achievement of given 
children in reading, history, chemistry, etc. 

Thus, it seems that the measurement of intelligence needs to be discussed, 
even though intelligence is not a subject of instruction, because it affeefs pupil 
achievement in subjects that are taught. Tn our presentation, the first con- 
sideration is the dimensions of intelligence for which measurement is at- 
tempted; next, the customary forms and procedures for its measurement. In 
connection with the latter, attention is given to the validity and reliability of 
intelligence tests, to IQ variability, and to the practical uses of intelligence 
testing in school practice 

Dimensions of Intelligence ~ 

The dimensions ol intelligence arc, of course, the properties or variables 
imputed to the construct toward which appraisal is directed. 1 here seem to 
be three general sources of thenr theories ot intelligence, the nature of the 
tests, and vernacular usage. 

THEORLTICAL VIEWPOINTS 

One theorist, Thurstonc, has analyzed the rc^ults of \arious kinds of intel- 
ligence tests mathematically to determine what the dimensions of intelligence 
are. His method is called Factor Analysis and it has yielded certain “primarv 
mental abilities”; spatial, perceptual, numerical, verbal relations, memory, 
words, induction, reasoning, and deduction (43) Another theorist, Stoddard, 
has used observation and logic as his principal tools ot investigation and, from 
a different point of view, offers quite different dimensions. As he puts it, “In- 
telligence is the ability to undertake activities that are characterized by (cer- 
tain properties) . . ” The properties, or, as we would say, dimensions that 
he asserts for these intelligent activities are: “1. difficulty, 2 complexity, 3 
abstractness, 4. economy, 5i. adaptiveness to a goal, 6 social value, and 7. the 
emergence of originals . . .” (40). 

Somewhat contrary to the viewpoints of these two who differentiate 
numerous dimensions arc the opinions of Xv^o other imposing figures in the 

2 This treatment of dimensions is cursory and for purposes of general understanding 
only. 



INTELLIGENCE 


363 


field of intelligence testing. Spearman, an English psychologist, asserts that two 
factors contribute to every intelligent behavior (38). One, a general factor 
called simply g, functions in all situations. In any given situation an additional 
specific factor, .y, is also operative. Bind, the “father” of intelligence testing, 
seemed to think that intelligence was more or less a whole or integral aspect 
of the individual, although he was not given to detailed definitions (4). 

DIMENSIONS INHERFNr IN TESTS 

The questions and problems of intelligence tests, our second source, permit 
us to infer several dimensions that resemble these of theorists and others that 
are of a diherent order. One very usual type of item in primary level tests 
requires that the child differentiate among a series of animals, household 
articles, geonieiiic forms, etc. Such a task requires that the child peneixe dis- 
criminate ly and, hence, a dimension ol discrimination is implied. 

In the 1937 Stanford Revision of the Binct (^Appendix B, page 478 chil- 
dren arc asked to look at two images (or ten seconds and then to reproduce 
what they saw. Again in the Binet and in the Wechsler-Bcllevue (Appendix B, 
page 479), testees hear a series of digits and arc required to repeat them just as 
they were said. Both of these tasks require that the child remember or recall 
what he just perceived Thus, recall may be ml erred as a dimension that the 
test measures 

A somewhat ditlcient dimension apparentiv is measured b\ the myriad 
of analogies items which occur in intelligence tests These arc verbal, graphic, 
and numerical. An example ol the graphic variety is found in the American 
Council on Education Psvchological Examination the so-called ACF 


Oci: > D U q czzidQI 

ABC 12345 

(Educational Testings Service, Princeton. Reprinted by permission of Educational 

Testing Service.) 

Here the student is asked to complete the proposition, "C is to 
as A is to nr or, as the test directions suggest, “What will figure C he if you 
change it by the same ride as A was changed to make 7?'^” It is obviou > (and 
from your own tsst-takmg experience vou may recall) hou similar verbal and 
numerical analogies go. I he dimension appraised in these items apparently 
has to do with perceiving something in one situation and applying it to another, 
l^ct us call the dimension generalization. Items other than analogy types seem 
to measure this dimension. For example, one ol the questions in the Otis Self- 
Administering Tests of Mental Ability (34) is 
“A lion most resembles a _ 

1 dog, 2 goat, 3. cat, 4. cow, 5 horse.” ^ 


■I C op> light by World Book Company and reproduced by special permission 




364 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


In many tests children are asked to recognize objects that they have seen 
before and to name objects and pictures. Such items are called recognition 
and/or naming questions, and it may be that a correlative dimension of intel^ 
ligence is being measured. This dimension (let us call it recognition), is akin 
to recall and discrimination. 

Problem solution is an activity found in most intelligence tests and con- 
sequently deserves designation as a dimension of intelligence to be inferred 
from the tests. Most of the “problems” are mathematical but some are verbal 
(34): “If the following word.s were arranged to make a good sentence, with 
what letter would the second word begin? same, means, small, little, the, as.” ® 
Still others are spatial. For example, in the Binet, Form L, the first test 
at age thirteen consists of verbal directions to draw the path you would take 
in searching if your wallet were lost somewhere in a round field (Appendix 
B, page 478). 

In final illustration of dimensions that may be induced from the items of 
intelligence tests, let us take notice of the hosts of items that ask what does 
this mean, pick out the word most like this one, tell me about an “orange,” 
etc. All such items require that testees give the same standard verbal responses 
to a given verbal stimulus. We call these standard verbal respon<^es definitions, 
and some of nearly all intelligence tests and nearly all of some tests ask the 
testee to define words. Consequently, it seems appropriate to specify it as a 
separable dimension. Definition, then, is another of the dimciftions of intel- 
ligence that we may infer from tests of intelligence. 

VERNACULAR DIMENSIONS OF INIELLICFNCF 

The third source of dimensions of intelligence provides items that re- 
semble those of Stoddard and other “logical” theorists somewhat more than 
they resemble the items evoked from statistical manipulations. When parents 
try to characterize more intelligent children (usually their own), they use such 
terms as “alert, he learns quickly, thinks fast, and sensible.” educators have 
been talking for many years about gifted children and how to educate them. 
According to them, gifted (very intelligent) children arc more critical, crea- 
tive, given to abstractions, perceptive, imaginative, etc. Dimensions of this 
type are a good deal less well defined and far less mutually consistent than are 
those to be inferred from tests or stated by theorists. They have the advantage, 
however, of describing what the majority of Americans mean by intelligence, 
and any group of intellectual dimensions given general distribution must con- 
note essentially what they connote if intelligence as measured is to mean intel- 
ligence as understood. 

DIMENSIONS FOR WHICH THERE IS GINIRAL CONCENSUS 

From this brief review of the dimensions of intelligence offered by each of 
three sources, theories, tests, and ordinary usage of the term, it is apparent 
that there is no single, agreed-upon enumeration of dimensions for intelligence. 



INTELLIGENCE 


365 


While the quip, “intelligence is what intelligence tests measure," represents an 
extreme viewpoint, it is well to consider that the “intelligence” measured by 
any given test is defined by the dimensions that the test attempts to measure. 
Fortunately, most test designers essentially agree in their choice of types of 
items, and, hence, their tests appraise much the same things. Some tests, of 
course, seem to include more dimensions than others but few arc concerned 
with any “unique” property.'* 

The following list is thought to subsume most of the dimensions lor which 
we could expect reasonable consensus among theorists, test designers, and 
professional users of intelligence tests and their results; 

Recall (immediate and delayed) 

Discrimination (likenesses and differences, details, patterns, relationships, etc.) 

Verbal symbolization (naming, definition, predication, etc.) 

Number symbolization (quantification, counting, arithmetic manipulnions. 
etc.) 

Abstraction (both verbal and numcrinal, inductive and deductive reasoning, 
classification, rule stating, etc.) 

Invention (problem solving in verbal, numerical and n.echanjcal situations, 
storytelling, free drawing, etc.) 

Adaptivity (capacity to learn, to adjust, etc.) 

The dimension of culapiixitx ts meant to he the “general dimension,” a 
need for which is cited by several ihcviiisls. It it is a general dimension, it 
should be manifest in all the specific operations eilcd as dimensions and it is 
thought to be so. Almost by definition, and certainly in terms of test scores, 
children’s growth in intelligence, their mental maturing, means that they 
chani^c in desirable directions re recall, discriminalion, and the rest Kounin. 
in his investigation of rigi iity-fle\ibility, distinguishes between a feeble-minded 
adult and a nornuii child of the same mental a^c on the grounds of flexibility 
(28). His “fiexibility” seems synonymous with our “adaptivity.” Frank N 
Freeman, among others, has argued that a measure of capacity to Icam would 
be the best measure of iiilclligcncc, thus inferring that learning ability (or 
adaptivity) is a dimension of the general intelligence factor (20). 

Moreover, a child’s success on an intelligence test means that he has 
learned certain things (names, number combinations, definitions, etc.) or can 


‘ In this discussion of dimensions, little attention is given to the dimensions which 
test publishcis say thev measure by their tests because the hulk of intelligence tests (in- 
cluding the Binct and the Wechsler) do not represent s\stem.itic edoits to measiue clearly 
defined dimensions. They ha\e been designed seemingly to diffeientiatc betwcf'u people 
of presumably different intelligence and items have been included that are known to 
differentiate. Obviously the authors of such tests ha\c had a concept of intelligence in 
mind, hut laigcly they have avoided a detailed specification of dimensions except as the 
labels foi classes of items may be called dimensions, such as verba! or nonverbal, analo- 
gies, vocabulary, block design, incongruities, digit span, etc. 



366 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

learn something very quickly (to complete syllogisms, to solve thought prob- 
lems, to figure out mazes, and the like). Only the subtests of rote memory 
(digit and sentence recall) could be said not to involy-e adaptivity, yet even 
with them the individual must change from no set response to a given set 
response. Stoddard establishes ‘‘adaptiveness to a goal” as a dimension of 
intelligence and by it seems to mean “adaptivity” as we define it. Though he 
lists it as parallel to his other dimensions, making it no more general than the 
rest, his listing is from such a point of view as to make all his factors more or 
less all pervasive. Finally, adaptivity seems to be entirely consistent with the 
correlations found between tests of such separate factors as spatial relations 
and language and it is widely used in technical and vernacular definitions of 
intelligence. 

Indexes of Intelligence 

If there is one thing most commonly “known” about intelligence, it is that 
IQ (of 100 or 110 or 90, etc.) is the index of intelligence. So well is this 
“known” that many persons find it difficult to believe that tests that do not 
yield an IQ (the ACE and the AGCT, for example (Appendix B) ) do meas- 
ure the same sort of intelligence as those that do yield an JQ. Man> other 
persons call an intelligence test an “IQ” test. And, finall>, the question “What's 
his IQ?” is as commonly asked about a pupil as is the question “What’s his 
intelligence?” 

Now as a matter of fact the IQ is only one ot several foims m which 
measures of intelligence are expressed. In recent years classiticatory-descrip- 
tive forms have been disregarded but a number of different indexes of rank 
and of scale position are in use in addition to the IQ. 

MENTAL AGE 

For children, mental age, abbreviated MA, is perhaps the most prevalent 
index of intelligence rank. Ihe standard individual intelligence test for chil- 
dren, the Stanford-Binct (Appendix B), yields an MA as do nearly all the 
group tests in common school use. MA means simply that a given child’s 
performance on a test is like the average perfonnance on the .same test of 
children of a given chronological age. Thus, on the California Test of Mental 
Maturity, Elementary 1950 S-Forin (Appendix B), the average score of 
children who are just nine years old is 54. Suppose that Harry scores 54 on 
the test; then Harry’s MA is nine years, no months. Chronologically, Harry 
may be eight years old or ten years, two months; but his Mental Age (MA) 
is 9—0 because his test score is like the average test score of children who are 
9-0. 

We need to understand very clearly that MA is an index of rank and not 
of scale position. MA refers to a population of children each of average intel- 
ligence ranked in ascending order of chronological age. It tells no more than 
where to place the child tested in this ranked population. 



INTELLIGENCE 


367 


INTELLIGENCE QUOTIENT 

The IQ, on the other hand, often does denote scale position. Except for 
certain adult and preschooj tests, intelligence tests invariably are designed to 
yield an IQ or Intelligence Quotient. In its inception, IQ meant a child’s 
Mental Age divided by his Chronological Age multiplied by 100, or IQ — 
MA/CA X 100, It still means this for the 1937 Revision of the Binct, the 
“standard” individual test for children, and many group tests. Consequently, 
you are advised to remember the simple formula IQ = MA/CA X 100. The 
manuals of tests that yield IQ’s may be read directly and thus no computation 
is necessary when testing. However, pupils’ records may contain an IQ or MA 
notation only and you may need to have the other index. 

As a scale expression, an IQ always has 100 as its reference point. The 
100 signifies the average lor any age group and IQ’s above or below 100 
indicate intelligence above or below the average intelligence for the age in 
question. Since intelligence, as measured, seems to be distributed in a “nor- 
mal” fashion (see pages 163-169 lor a discussion of the normal probability 
curve), IQ increments may be considered fixed fractions of a Standard Devia- 
tion and thus they approximate the chaiacteristics of scale units In the stand- 
ardization population of the 1937 Revision ol the Bmet, the size of the 
Standard Deviation was 16 IQ points Group tests habitually are “normalized” 
to an IQ distribution with an SD of approximately 16. 

Deviation IQ's. Some IQ’s, most notably those derived from the Weeh- 
sler-Bclleviic, do not mean a ratio of MA to CA but rather a deviation from a 
mean of 100. Because in such tests the Standard Deviation of IQ’s approxi- 
mates 16 (it is 15 in the Wechsler) the “deviation” IQ and the ‘ratio” IQ 
may be used interchangeably for school-age children. For adults, the ratio 
IQ is inapplicable cxccpi by the Stanford-Binct convention that all adults 
have a chronological age of 16. 

The Standard Deviation unit is used directly as the basis for st ilc scores 
yielded by one important intelligence test, the AGCI (Army Gcncial Clas- 
sification Test, Appendix B). The many ol us wlio were processed by the 
Army during World War JI were used to AGCT’s of 90, 1 15, 105, and the 
like. Since by lest design 100 was the average score of all “literaie” nainces, 
90 meant below average and 1 15 and 105 above. 1 he Standard Deviation ol 
AGC1^ scores is 20. Thus IQ’s and AGCT s are not directly comparable even 
though they both are scale scoics with means of 1 00 

PERCENTILE RANKS 

A third basic expression tor measure of intelligence is percentile rank. 
This occurs most notably in the ACE (American Council on Education Psy- 
chological Examination, Appendix B), a college-level test of academic ability. 
The publishers of this test have asserted that IQ's are inappropriate at the 
adult level and that an indication of rank within a known population is a more 



defen^^ie ditoSbu- 

tions tbefc is^ toidiWitfeatiQ fp^j^ ^emy 

tion^distanc^ intel- 
ligence max be^e1a||^il tbb popu- 
lation. (See i)age ^ 

refer to thg, idriie p^u^gltt, tfi^ may fidime rdated. 

tionship, a>tltimbari^ stb6ol4ge intelli jpnce. inel|fel4^^(l^torcentile 

equivalents of their IQ^. Tlnspicilitates^dies of pupils tw^iiklheir achieve- 

ment in subjects be^exfjifessed in ji^rceiitiles.' An example of improper 
comparisons would^Ce'betwA IQ and percentile j^ank on attest standardized 
on college stud^ts. GftrdinarSy a percentile of 50' would an 10 of 100. 

In this case, however, a^ercentile of 50 would more likely |qOal an IQ of 115 
to 1 20 and a per^entil^aiik of 1 or 5 would more nearly compare with an IQ 
of 100. 

UNRrLlABlLIlY OF INDEXES OF IN ILl LICENCE 

In conclusion, it should be noted that none of these measures of intel- 
ligence IS exact, as yoUr age is exact or your weight today. Each is derived 
from a test that has some degree of unreliability.tBccause of test urjrcliabilil> 
each raw test score is unreliable and then, of course, so is the MA, IQ, or 
percentile indicated. The unreliability of a lest score is called its Standard 
Error (see page 169 for discussion of Standard Error), and^the Standard 
Error of intelligence indexes are tipprcciable. One of the most reliable IQ’s, 
that derived from the individually administered Stantord-Binet, has a Standard 
Error of approximately 4 in the ihiddle of the distribution of IQ’s (41 ) It 
becomes greater lor higher IQ’s and less for smaller ones. Thus, if a child 
whose Binet IQ was lound to be 104 were retested a large number of times, 
the probability would be that 68 per cent of the IQ’s found for him would vary 
from 100 to 108, 96 per cent from 96 to 112, 99 per cent from 92 to 116, 
and so on But with no certaiiit> could you say at any time that his IQ was 104 
exactly. Since other intelligence tests are hardly more reliable than the Binct 
and many are known to be less, it is safe to assume a Standard Error of score 
of the same or greater size for IQ’s derived from them. 

The General Nature of Intelligence Tests 

The construction of intelligence tests is not a problem for the teacher or 
school administrator. There are several score published tests that have suf- 
ficient reliability for school use and that, collectively, are pertinent for all ages 
of children, youth, and adults and for the special needs of dTferent types of 
schools. In this section we shall not attempt to catalog these tests nor to de- 
scribe any paiticular specimens. Annotated listings ol intelligence tests occur 
in Appendix B, pages 475-479, along with listings of other standardized tests. 
Rather we shall examine their common characteristics so that you may have a 
more reasonable basis for selecting a test and tor interpreting its scores. 



INTELLiqEiAttt: 


my require 

fj'ix'ltt ia) 20 (b^50 (c)%l'(d^r 


Awnt (means) (1) graocflhother (ib motheffe si&tV 
(3) a neighbor woma^ '(4)|tfies^mc 
as sister 

The performances may involve instead the recall aiifl application of some gen- 
eral learnings, as, 

Rearrange these woids to form a sentence, 

left eat dogs dinner arc ** 
bones that from 
« 

Or the performances may not depend upon any prior leainmg but simply re- 
quire a given adaptive behavior, as, 

Repeat these digits just as you hear them 4 3^62 


Find a drawing that is a dilTcient view ol the object in the first di awing 



(From Califoiniu Test of Mental Matunn - It tet me iliute '^e'lts 1946 Rcstsum 

Cop> light C ilifoinia Kst Bure m ) 

By far the largest number ot intelligence test items arc of the ie.iined per- 
formance variety Tins nccessi^itcs the assumption on the part of icst-niakers 
that all who take the test have had equivalent opportunity to learn the lequired 
things This fact of intelligence tests makes them most appropriate lor sihool- 
age children who have had fonsccutive experience m public schools U tends 
to diminish their validity lor some ethnic and socioeconomic groups, a factor 
that we shall explore in a moment 

Nonverbal Items Because verbal items require reading (or good listen- 
ing vocabulary in the case of spoken questions in primary tests and in in- 




370 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


dividuaily administered tests in some instances) pupils who read well do 
better than those who read less well on verbal items. Fox this reason, most 
tests applicable to the elementary grades and some app^priate for the second- 
ary grades include some nonverbal items* Many a sufficient number of 
nonverbal items to warrant their categorization and separate scoring as such. 
Generally speaking, tests devised for older children contain fewer nonverbal 
items. There are certain tests that presume to be strictly nonverbal (see Ap- 
pendix, page 477). It is necessary to remember, in the case of nonverbal items 
or tests, that the directions usually still are verbal and that the responses often 
must be verbal. Two examples of nonverbal, or as they are sometimes called, 
performance items are: 

1. The usual “What part is missing” item, occurring in many primary 
level tests. 





2. The less usual incongruous element item. In this case a picture is shown 
that contains something awry and the testee is asked to indicate the incon- 
gruous element. I'he example shown is from the 1937 Revision of the Stan- 
ford-Binet. 



(From Revised Stanford-Binet IntelUf^ence Scede, L. M. Terman and 
M. A Merrill, Houghton Mifflin.) 



LMELLIGENCE 


371 


Group Tests intelligence tests frequently are classified as grt^up and 
individual tests The termer are, without exception, guided response instru- 
ments and the majonty of their items are of the multiple-choiLC variety Nearly 
all of them are admmistrable in two hours or less, with many requiring no 
more than fort> minutes Typically the younger the age for which the test is 
pegged, the shorter the test 

Nearly all group tests, except those tor primary grades contain some 
arithmetic problems In an examination of 22 high school and adult tests 
Kenney found a range from 1 to 56 per cent of the items being arithmetic 
(27) Naming synonyms completing series (2-6~lS- » ) and answering 
analogies questions arc fuithcr typical item types 

Group tests as the classification implies are administered to many chil- 
dren at once and may for this re ison have less validity and rLiiabiliiv than do 
individual tests It is intended that group tests be administered by teachers 
without special training The instructions in accompanying manuals ^nd on 
the tests usually are clear and comprehensive 

Individual Tests Individual intelligence tests are lewer in number and 
as a rule, require special tr lining and skill for administration Ihe most pop- 
ular for sehool-agc children and youth the Stanford-Bincl and now the NVlSr 
or Weehslcr Jntclligcnec Se ile for Children are essentially structured intei 
views with predetermined tisks and predetermined ba>es for recording and 
1 iting responses Sonic of the items entail guided responses, but the gi eater 
number allow the Ustce to respond as he will I hese tree responses require 
that the examiner use his judgment as well as a kc\ For e wample in the Binct 
one test asks that the child kxik it a picture and then ‘tell >ou about it ’ From 
his tclliniz vou must giuec what credit to <iiIow him lo help you ai d to stand 
irdizc seoiiniz the lest minual eoriiain> i detailed anaKsis schedule for rc 
sponses 

Individual tests are considered moie \ lid and more reliable and henee 
are used to veiify izroup test findings, to deteet mental dchcicn > and to 
diagnose mtcllcetual stiengtl j, and veaknesses Psvehonieliists and elinical 
psyehologists usii illy leain to administer individiu 1 icsts is part ot their train- 
ing and perform this serviee tor sehools 

PROFILLS AND DI\C\OSIS 

In recent years mere ised attention has been given to difiercntial aspects of 
intellmeiiee Frequently it has vcn said with truth that an 10 hides as much 
as it shows A ehild might do poorh on nunibei item^ well on definitions and 
average on dimt pm to gel his 10 of 100 Another might do well on number 
items poorly on digit span and aveiage on definitions to get the same 10 of 
100 Now that some relatively disunet factors (oi as wc would name them, 
dimensions) of miLlligenec have been determined statistically and bv reason- 


Administi ition times foi pccific tests arc in Apperdex B 



372 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

ing, somls test designers have attempted to measure these factors separately 
and to furnish separate scores on each such factor. These scores are often 
recorded on profile sheets comparable to the one shown in Figure 51. 

The factor scores and/ or their profile representations furnish information 
which single MA\s, IQ’s, percentiles, or Standard Scores do not. They tell 
such things as whether a given IQ represents much memory but little adaptiv- 
ity, or little memory and much adaptivity. Ihcy may suggest the extent to 
which test nervousness or anxiety has depressed performance since this condi- 
tion affects exact recall and attention-demanding performances the most. And 
thus the factor scores or profiles have some diagnostic significance. To know 
that a youngster of Greek parentage scores verv low on language factors but 
fairly high on number factors is to know that probably he is very bright but 
needs tutoring in language. To know that a child with good grades in arith- 
metic scores low on reasoning but high on memory is to be able to predict 
that he will likely have more success in business arithmetic than in algebra. 

In using profiles or factor scores, it should be recognized that the part 
measures are less reliable than the total score and that group tests are not 
clinical instruments. But for detailed information, for leads to further measure- 
ment, and for referral [)urposes they are of far greater value than single 
indexes. 

RELIABILITY AND VALlDirV OF IN rELLICFNCE TESTS 

Such precise data as are available for the reliability and validity of specific 
tests are presented in the Appendix along with other brief identifying notes on 
the tests, I lerc it will be enough foE us to consider (he matter of reliability and 
validity in general. 

Using split-half and some comparable-form or test-retest coefficients of 
correlation as indexes of rcHability, the bulk of published tests for school-age 
children lia\c reliabilities comparable to or in excess of the reliabilities of 
standardized tests generally. Coefficients of reliability tend to range from .88 
to .97. Thus, if any standardized tests can be considered reliable enough for 
school use, intelligence tests can. Preschool tests and, in particular infant 
scales arc much less reliable. Except for extremely retarded or extremely ac- 
celerated infants, IQ's should not, in the opinion of the authors, be predicted on 
the basis of these tests. Somewhat more confidence may be placed in measure- 
ments of three-, four-, and five-year-olds; but even here IQ’s should be as- 
serted with great reservation. 

The validity of intelligence tests has been disputed in recent years. With- 
out detailing the arguments or without examining the great body of research 
that bears on their validity, we may note the bases for assertions of validity 
for tests and two points at which the validity of tests seems to be vulnerable. 

Bases jor Claims of Vaiulity. Tests commonly base their claims of valid- 
ity on correlations wath other tests, on correlations wath school marks and 
other “external” indicators of intelligence, on analyses of growth curves, and 



INTEI I IGENCE 


373 


1946 REVISION 


Intermediate 
Grades 7 10 
and Adults 


Name 


CALIFORNIA TEST OF MENTAL MATURITY— INTERMEDIATE SERIES 

Devised by Elizabeth T Sullivan Will s W Clork and Ernest W Tiegs 

Occupotion or Grade 


Date 

Teacher or 
Examiner 


Ago 


TEST FACTOR 
1 Vi ual Acuity 

2. Aud tory Acuity 
3 Motor Co ordina ton 


Last Birthdoy 
School or 
Organization 


Sex M F 


10 _ 

15 _ 
20 


DIAt N STIC 
(Lhait c r 


96 120 132 144 156 168 180 192 204 216 240 2B8 < 


-h 


A 

Memory 

50 


4 Immediate Recair 

30 


5 Delayed Recoil p16 

20 

B 

Spatial Relationships 

45 


6 Sensing R ght and Left 

20 


7 Man pulot on of Area 

15 


8 Forcsigi t in Spatial S t n; 

, 10 

C 

Logical Reasoning 

60 


9 Oppo ites" 

15 


10 bimilarities* 

b 


1 1 Analog cs* 

15 


15 Inference (p 14) 

15 

D 

Numerical Reasoning 

45 


12 Number Senes 

> 


13 Numpricol Quoptity* 

15 


14 Numerical Quantity 

15 

E 

16 Vocabulary 

50 

Totol Mental Factors 

AjBHC 0 t 

250 

F 

Language Factors 

(j ) H-j 1 fir 

100 

G 

Non Language Factors 

ITolal M nial Fa In s M n F 

150 


100 11 0 120 130 140 150 160 170 180 200 2 0 


J, 




J il 


Chronological Ago 
Actual Grade Placement 

(O U I 1 s Ir y 


•Non Lan^uoRe Tests 


Mental Age 

SUMMARY OF DATA 

Total Mental Factors 
F Language Factors 
G Non Language Factors 


m 

1 

1 n 

1 

0 10 1 

1 0 

10 1 1 

1 1 01 



') ? 

3 


4 ' 0 ^ 


c 

1 

PO 




1 , 

1 

1 

0 ’ r 


T ' 



ion ' 1 



1 


144 

t'6 


.1 



1 


r ' r 

1 

7 * B S 



n 14 1 

I 

— — 

1 

Yr 80 

1 

100 

ilo 

t 

lio 

1 1 
ijo i4o 

1 

15 0 

1 1 
l4o 17 0 

1 ' 1 
180 700 

24 0 

96 

120 

132 

144 

156 168 

180 

' 192 204 

2i6 240 

288 


CA 


It 101 I loir by c 


I Q 


u J il de by 10 


I I hed by ( I in It Huiei 
5016 nollywood Boulevard hot Angeln 28 CaUfornla 


Figiirc *51 Piofilc Pigc of California Fest of MttUal Mai int\ — Intcumduit Sdics, 
1946 RiMsion CopyngU 1936-1951, California Test Buicau 





374 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


on internal evidence. The first basis involves giving two intelligence tests to 
the same group and obtaining correlations between the scores. The Stanford- 
Binet is frequently the second test. Sometimes a battery of tests is used instead 
of just a single criterion. Such a process is good, of course, and high correla- 
tions mean high validities as long as we may assume the validity of the 
criterion test. The second basis is that of comparing IQ’s with school marks 
or teachers’ ratings. The correlations between test scores and these variables 
generally are lower than between tests. Yet this basis docs involve an external 
criterion that seemingly should be related to intelligence. In the third place, 
test-makers examine the scores of successively older children and, when a 
smooth rise in mean scores is found, they say J^his indicates validity. The rea- 
soning involved here is that intelligence “grows” smoothly and if test scores 
also “grow” smoothly they must be measuring intelligence Finally, statisti- 
cians analyze tabulations and graphs of the scores of tests given to persons of 
known characteristics and from this ihev infer validit\ or the lack of it. If, 
for example, the graph of a large unsclected population has a “normal” shape, 
validity is indicated since intelligence is presumed to be “normally” dis- 
tributed. Moreover, validity is to some extent a matter of inspection. I’he test 
says it measures memory. Look at the test and see what items there arc that 
require recall for correct answering. 

Verbal Bias. With these ways of appraising validity in mind, let us ex- 
amine two aspects of intelligence tests often criticized when n^alidity is dis- 
cussed. In the first place, the written tests require from a little to a great deal 
of reading and the oral tests require enough oral vocabulary to understand 
directions and to say the answen^. Unquestionably, those children who have 
the greater reading and speaking skill score higher on the intelligence tests 
usually employed in schools, and those who have the less verbal skill score 
lower because of the difTcrence in verbal skill. Time and again this point has 
been probed by researchers and always the generalization is verified. One 
recent example of such research is Darcy’s (13) in which he found that 235 
children of Puerto Rican parentage had mean lO's of 79.6 on the Pintner 
Verbal 'Fest but mean IQ’s of 87.8 on the Pintner Non-Language Test. An- 
other example is the comparison made by Bond and Fay (8) of the perform- 
ance on individual items of the Stanford-Binct scale of good and poor readers. 
When children were equaled as to mental age, the investigators found that 
good readers performed significantly better on knowledge and word use items; 
the poor readers better on nonverbal and memory items. They conclude that 
poor readers are underrated and good readers arc overrated on the Binet. 

One of the authors has administered Bincts to many school children of 
Mexican parents who spoke their native tongue in the home. It is his impres- 
sion that two out of three of these children had been found with low IQ’s and 
had been considered for mentally retarded classes not for any want of bright- 
ness but for want of opportunity to learn the English language. In particular 
docs he remember one boy of eight or so who was unable to identify any of 



INTELLIGENCE 


375 


the pictures of common objects or to define any of the words m the vocabulary 
list yet he seemed to know many of them. Finally, the examiner asked him to 
name objects and to define words in Spanish, whicli he did at a great rate. 
But the examiner didn’t speak Spanish so a teacher had to be found to trans- 
late. With her help the boy named and defined to a degree commensurate with 
his age. 

Now it may be contended, with justice, that this verbal bias ol intelligence 
tests docs not necessarily invalidate them. Speaking and reading, writing and 
listening, of all human behaviors, perhaps necessitate the most discrimination, 
memory, generalization, etc., and hence you should c\pect the children more 
skilled verbally to be the more intelligent. Moreover, the primary use of intel- 
ligence tests is for educational prediction and placement and. assuredly, the 
common curriculum is a verbal one. So when the children tested have had 
.seemingly normal opportunity for learning the English language or when 
their scores on nonverbal elements are equivalent to those on verbal eicinenls, 
an 10 derived from ordinary test.', is meaningful. But, when these conditions 
do not obtain, we advise that you view such IQ's with suspicion if by 10 you 
mean a measure of intelligence as defined here. 

CuUurc Bias. Intimately related to the language bias of tests is what ']> 
now called their culture bias. Davis, Havighurst, Cells, and others (16, 17, 
19, 25), principally of the University of Chicago, have for many years been 
contending that intelligence tests disci iminale against children ot lower class 
socioeconomic status. Essentially, they argue that American published intel- 
ligence tests have middle-class contenl. Lower class, pupils have little or no 
access to this content and thus have restricted opportunity to learn what they 
must have learned if they arc to succeed on the tests "I'o examine this possibil- 
ity of cultural bias, let us consider tiic following items selected from a wideh 
used group intelligence WdS* 

In three sections of the test, proper answering requires the proper iden- 
tification of small pictures portra}ing many dillercnt objects, pci'.ons, and 
activities of American life Among these arc such things as a chandelier, a 
meat grinder, skis a knapsack, a radio, an Egyptian statuette, a wolf and a 
lamb, a dripolalor and percolator, a western saddle, ice tongs, and, it is neces- 
sary to confess, a number of objects with which the authors have had no 
previous experience and hence which they cannot identify. In the vocabulary 
section occur entreat, lacetiims, disparai^e, and pirsai^c to say nothing of 
vertigo and quondani. 

So far as tlie objects arc concerned it is open to question whether children 
twelve to fifteen in age, the group for wdiich the lest is designed, raised in 
migratory labor camps in California, in a tenement in Chicago, or on a rented 
farm in Arkansas would have had direct or even indirect experience with all 
these objects. As to the words, it is thought that only families of considerable 


6 The test is thoutihl to have neither more nor less cullural bias than te.sts in general. 



376 


CUSTOMARY UStS OF MEASUREMENT AND EVALUATION 


education and having interest in serious reading would ever use the words 
or even own books containing them. Yet, the words represent ideas of no 
great complexity Lower class and even many middle-class people simply use 
diflercnt expressions lor them, i c , 

lor entreat, beg loi picsdgc predict 

facetious lunn> “ vertigo diz/y 

“ disparage belitllc ‘ quondam toimer 

It seems evident to us that intelligence tests do have a cultural bias and 
that the bias is toward that ol the culture ot teai hers As predictors ot success 
in school, however, this should not invalidate the tests For, as we know, the 
culture of teachers largely constitutes the cultuic of the schools and to be or 
to become proficient m this culture is to make high marks Ihe I act of bias 
does necessitate care in interpreting test see res It rcquiies that decisions about 
children of lower class environments be made only after a careful analysis ol 
test performance and, il needed, retesting Some tests, and portions of most 
tests, are relatively culture-free Davis and Lells have published one, which 
they say has scarcely an> cultural bias ( 1 "> ) 

IQ VARIABIinV AS AMONG ITS IS 

We have mentioned that an> intelligence test has some degree ol unreli- 
ability and hence, on icpcated a[)phcation of the same test it should be an- 
ticipated that a pupils IQ will vary by several IQ pomis lleie we need to 
observe that diffcrcfit tests *ieim!nistcrcd to the s mic pupil are likcl\ to yield 
diflerent 1Q\ lest designers use dillcn nt standardization populations ihe 
shapes of the distributions of scores vary from test lo test Ihe standard devia- 
tions, basic umis of variabiht} lor dislnbutiems of scores are tzrc iter lor some 
tests than othns Each test has a diflcicnt pioporlion ot mathematics items, 
vocabulary items, noiivcibil items etc lor these and peihaps other reasons 
it has been impossible for test designers to produce tests that are mutually 
comparable loi all their rang^ despite the seiious clU)it^ ot most test designers 
lo do lust that 

As a rule IQs from one test arc most compc»rable to JQ’s from another 
lest at the middle of the distribution or let us say, from IQ 90 to IQ 110 As 
extremely low or extremely high 1Q\ are encountered the IQ s often become 
somewhat dissimilar The si/c of these variations is exemplified bv Lennon s 
study of three tests applied to 1,200 hmh school students (29) He compared 
the rerman-McNemar lest of Mental Ability with the Otis Quick Scoring 
Mental Ability Tests and the Pmtner General Ability Tests Among his find- 
ings were that Pmtner IQ's were 2 to S points lower than Terman-McNemar’s 
and that the Otis had a smaller SD than did the Tciman-McNcmar For this 
reason an IQ of 66 on the Icrman McNemar meant an IQ of 76 on the Otis, 
and an IQ of 131 on the Otis meant one ot 141 on the 7 crman-McNcmar 



INTELLIGENCE 


377 


Whether a given test’s TO tends to run high or low, and how much, is 
often noted in critical reviews of tests. These occur in many professional 
journals (Educational and Psycholoi^ical Measurement, for one) and most 
notably in Buros’ Mental Measurement Yearbooks. In addition, the school’s 
or district’s psychometrist and/or psychologist should be able to give advice 
about the IQ's of different tests. Where judgments about mental development 
over a period of lime are to be made, the variability among different tests is 
particularly critical. If possible, these judgments should be based on readmin- 
istrations of the same test or its comparable forms. 

SOURCES OF INFORMAIION ABOUT INTTILIGENCT TESTS 

A number of intelligence tests and several publishers of them are cited in 
Appendix B, pages 475-47^). Additional bibliographic sources of intormation 
arc Buros’ Mental Measurements Yearbook (9), the Encyclopedia aj Fdiica-^ 
tional Research (30), professional journals (psychological as well as educa- 
tional), and publishers’ catalogue:>. The Yearbook generally is considered the 
source. Certainly, in it tests are most likely to be viewed critically and it is 
the only source intending to be exhaustive. Publishers' catalogues provide 
the most complete statements of price, length, scoring, etc. Some provide reli- 
ability indexes but they tend to stress the wide applicability of any test rather 
than its limitations. The manual of a given test usually describes the w'ay in 
which it was standardized and provides evidence of reliability and validity. 

Administration and Scoring of Group Intelligence Tests 

Practices in the administration and scoring of group intelligence tests vary 
from school district to school district. In some, a special person, usually 
designated as a psychometrist or psychologist, administers and scores group 
tests as well as indivklual tests, Jn others regular teachers (or in high schools, 
counselors or orientation teachers) administer and score the group tests. In 
a few cases a teacher administers and a specialist scores or vice versa Districts 
having access to machine scoring devices typically u^e machine answer sheets, 
where usable, anvl the tests arc scored by a clerk or technician. 

Precise directions for administering and scoring tests are contained in test 
manuals. The general principles of test administration were presented in 
Chapter b, pages 118-121, and are applicable to iniclligencc tests. When and 
if you administer an intelligence test, you arc asked to be extremely attentive 
to the directions and to adhere closely to principles of good lest administration. 
No other single test result likely to have such an important bearing on how 
a child will be handled in school. 

Good test administration, you may recall, is more than a mechanical 
matter. It involves the way children feel about the test situation and what 
you do to make them feel right. As you know, the technical term for this 
feeling matter is rapport. Unless rapport is adequate, don’t administer the 
test. If Betty seems frightened and you can’t coax her out of it, have her take 



378 


CUSTOMARY USES OF MEASUREMENT AND EVALUATIONV 


it anothei: day. If children appear to rush because a passing bell is coming 
soon an^H isn't a timed test, invalidate that portion of the test and retest them 
on another day if necessary. Should unusual distractions occur during the 
testing period, make note of them and judge test results in the light of them. 
You may wish to discount them. In short, you are enjoined to administer an 
intelligence test only in the best manner and only under optimum conditions. 

Evaluation of Intelligence 

Few evaluative standards exist for intelligence and, for the most part, 
evaluation is not involved in appraisals of intelligence. Since intelligence is 
presumed to be a factor that bears on school aclnevement and not a matter of 
achievement, teachers do not mark or grade a child in it as they do in arith- 
metic or public speaking. When a school psychologist measures a pupil’s JVIA 
as part of his cflort to determine how best to handle him, fact finding, not 
judgment, is involved. Judgment or evaluation, as we define it, entered when 
a teacher referred the pupil because of /72/,vbehavior or poor achievement or 
presumed neurotic symptoms. And later the student’s adjustment will be 
evaluated to determine if he should return to this teacher’s class. But the fact 
of a low or a high MA is simply a diagnostic reality, not a matter for judgment. 

Mental Deficiency, The evaluations expressed regarding intelligence and 
what standards there arc usually have to do with mental deficiency, genius, and 
academic or vocational placement. Many states have laws and^many school 
districts have regulations as to what constitutes mental deficiency. The aspects 
of these laws or regulations that concern teashers are standards of educability. 
Typically, these arc very general sratemcnls about the child’s ability to profit 
from instruction, to learn a trade, to be self-reliant, etc. After individual test- 
ing the examiner judges whether or not the child should be kept in the regular 
school situation or given the special treatment provided for those who fall 
below the legal standard. By custom, perhaps also for good reason, an 10 
of 65-70 lends to be the rulc-of-thumb line between those judged mentally 
deficient and those judged normal but dull. Standards (in terms of lO's) for 
the further extremities of mental deficiency, those that require institutional con- 
finement or some degree of permanent custodianship, are of little significance 
to regular teachers since children so extremely deficient rarely reach the 
schools. 

Gifted Children. .At the other end ol the intelligence spectrum lie those 
who are cafied “gifted” or “precocious” or “geniuses.” In recent years much 
less effort than heretofore has been given to precise classifications of genius. 
Genius, as you arc aware, is simply a word we use for those whose achieve- 
ment we greatly admire. It has no precise signification unless we wish to give 
it one. Terman, in his studies of “genius,” (42), used a specified 10, 140, as 
a basis for admission to his group of gifted children. However, lower IQ levels 
have been designated for genius by other investigators, 130, 125, even 120; 
and, on the other hand, it has been asserted that the term should be reserved 



^INTELLIGENCE 379 

for those of IQ 180 and above (30:505 ). ihus, il in a school you ^^ish to find 
your “gifted"’ children, you will have also to find a standard for th^m. 

Subject Placement. Vague as intelligence standards are for mental 
deficiency, they arc even more vague for subject and vocational placement 
From many studies (30:878-883) a positive correlation has been established 
between intelligence and achievement. Yet the precise degree of mentality 
(expressed as an 10, percentile, etc.) essential to success in any subject or 
vocation is largely undetermined and ii may be undeterminable. It is cus- 
tomary to restrict study of higher mathematics, laboratory science, and for- 
eign language in the high school to those of “bettei than average"’ intcUigence, 
but neither research nor practice warrants designation of precise minimum MA 
or IQ levels to these or any other subjects In general, it may be said only 
that as subjects are more verbal and abstiact, a higher level ot intelligence is 
required, and, as they are less verbal and abstract, chikhen of less intelligence 
have more chance of success. 

Vocational Placement. Similarly, in vocational or professional training, 
positive correlations hove been found between success and inlelligenLC in 
many areas (30:886-890). No minimum levels of intelligence have been 
established and we may be sure only of the general principle that the moic 
complex, abstract, and self-directed vocations require the more intelligence. 
Jn applying this principle, schools and industries frequently adopt exact in- 
telligence levels as cutting points in s^^ccting trainees. I hey do this arbitraiily 
and must defend their designated level, on the basis of “it works for us.” 

Measures of Intelligence and Educational Practice 

As we ha\e studied the sc'cral aspects of measuring intelligence — dimen- 
sions, forms, procedures and, just now, standaid'. — we necessarily have sug- 
gested many application^' ol mtclhgenee measures to educational practice 
Now, to conclude the cluipter, vve wish to slate some cautions to be observed 
in the use of 10"s and MA"s 'md then to specify certain school use lor meas- 
ures of intelligence 

SOMf (AUIIONS ABOL r IIIl TSl OV IQ's AND ]\I \"s 

Heredity and PJnvironment. To begin, the intelligence measured b' a test 
l^ a luni'tion of both heredity and environment A Lhild's parentage and an 
ccslry does, through genetic transmission, atlect his 10. But his preschool and 
school experiences also do allect his JO Thus a social w'oikcr may not with 
justice declare that Bill’s case is hopeless because he is born to illiterate and 
dull parents On the other hand, he may not expect to change Bill’s intelligence 
radically by placing him in a good luster home. Special investigation'^ of twins 
raised in diflerciii homes, of foster children-foslcr parent relationship, of racial 
and national ditlerences in intelligence, and of famous and infamous families 
all indicate that environment and heredity contrihuie to intelligence in un- 
known degrees (23, 31, 32, 36, 37). 



380 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


IQ Variability. Secondly, it may be observed that the IQ is not fixed. 
We have seen that any test’s inherent unreliability makes it likely that retest- 
ing will produce difl'erent scores. We have observed, moreover, that the IQ’s 
from dillcrent tests arc likely to vary because of diflcrences in items and in 
shapes of standardization distributions. In addition, IQ's arc not fixed because 
test pciformance is a function of the pupil’s health and attitude as well as his 
intelligence. As these improve or deteriorate so does his test score. They are 
not fixed because specific instruction on the content and method of the tests 
(not cribbing, of course) may improve test performance (5). And they arc 
not fixed because intelligence itself may “grow” at a dilTcrent rate from that 
presumed by the test designer and hence the IQ’s may vary. It is impossible 
to predict just how much IQ variation is likely to be encountered although 
researchers have tried to find out (18 ). As a rough guide, we suggest that you 
consider variation of up to five points as usual, up to eight points as nothing 
to get excited about, but over eight points as a matter that deserves specific 
explanation. 

Desif^ned for School Children, A third point of caution is that intel- 
ligence quotients and mental ages are designed for schoobage children who are 
attending school and should be so used. We have seen (page 372), that pre- 
school and particularly infant tests yield IQ's that are ineliicient predictors 
of school age IQ's. F^'or adults the concepts of mental age and of IQ as a ratio 
are clearly inappropriate. And intelligence tests presume a common con- 
tinuous experience on the part of those tested. Children not in school are 
those least likcl\ to have this common continuous experience. 

Relation to School Marks. Then, you have heard many times that IQ’s 
bear a correlation of about .50 to school marks. Actually, there is less rela- 
tion wath some subjects, the pciformance ones, and more with others, the 
predominantly veibal ones. Fach study finds a different correlation figure 
and each test seems to show a dillcrent figuie, but for the elementary and high 
school grades the correlations tend to cluster around .50. Now it is important 
to know that this .50 does not mean 50 per cent. It does not mean that half 
of the variation in marks among students is to be attributed to variations 
in intelligence. If it connotes any percentage, it should connote 25 per cent, 
this being the percentage of variation that a coefficient of correlation of .50 
indicates is explained by the correlated variable (See page 180 for further 
explanation of this.) 

Parental Intci prctation of an IQ. As a fifth caution, we need to be re- 
minded that parents tend to iliink that an IQ means a given grade of intel- 
ligence and that is all there is to it. A relatively small percentage may under- 
stand it as the teacher understands it. The great remainder will not. These will 
not treat an IQ as they would an index v)f height or weight. Rather, they will 
worry about it or wish to biag about it to their neighbors. They will view 
their child’s IQ as a proper measure of a tangible mental substance, intel- 
ligence, and the measure, whether low or high, will have great significance 



INTELLIGENCE 


381 


for their egos. This is why we maintain the rule: parents are not to be told 
titeir children's IQ's. 

Individual and Group Predictum. Finally, in the way of cautions, Jet us 
diderentiate between the use of IQ's for group and for individual prediction 
In Chapter 8 wc discussed the matter of probability in behaxioral measure- 
ment and wc saw, among other things, that the precision of group measures 
was related to the size of groups measured. Tlius an average for a group of 
is less preci:,c than an average for a group of 100. If you will consider that 
one pupil is the smallest possible group, then you can see readily that the 10 
found for one pupil is more of an approximation than is the average IQ lound 
for 100 pupils. Therefore, since precision in measurement affects accuracy 
of prediction, we must consider it far safer to make predictions about gioups, 
having measured their intelligence, than to predict about individn ils For 
example, you may say with assurance that 100 children having IQ's of 110 
(^r more will as a group do belter in algebra than another 100 having K)‘s of 
109 and less. But you should have much less confidence that John vvilh an 
IQ of 115 will (nakc a higher grade in algebra than If)m with an IQ of 105. 
The odds are that he will, but it’s far from a sale bet. 

nir usis or iMriLioLNcr tlsis 

IQ’s and MA’s, intelligence percentiles, and standard scores are useful 
in many wa>s to teachers and school administrators Thc\ should be used onl) 
with full understanding of their significance and fallibility and with such cau- 
tions as we have described. However, they should be used. Simply to ad- 
minister an intelligence test or to know a pupil's score accomplishes nothing 
in itself. What matters are the things that may be done on the basis of intel- 
ligence testing that could r.iM be done as cnicienliy withotit it. Among the W'a\s 
in which the scores of inlv lligeruc tests may be used are the following: 

(a) In determining reading readiness. 

(b) In determining whether or not any pupil's .iciiievcment appioxiniatcs 
his potential. 

(c) In determining eligibility for subjects or courses requiring a high 
degree of intelligence. 

(d) In establishing sections of a giadc or a com sc dilTcrenliatcd accord- 
ing to ability 

(e) In vocational and educational guidance in secondary schools. 

(f) In selection of reading materials for given classes. 

(g) In assigning pupils to special classes or curiiculums. 

Summary 

Intelligence is a construct to explain differences among individuals as to 
intellectual aspects of behavior. Its dimensions vary from theorist to theorist 
and from test to lest. In this context the dimensions are considered to be 



382 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


recall, discrimination, symbolization, abstraction, invention, and adaptivity. 
Differences in intelligence are expressed primarily by Intelligence Quotients 
(IQ) and Mental Ages (MA), but Percentile Rank and Standard Scores are 
in use as well. The last is the measure preferred by the Armed Services and 
percentile rank is employed by the most widely used college-level intelligence 
test. 

Numerous tests admmistrable to groups arc available for every school 
grade and for adults. The Revised Sianford-Binet Scale and the Wechsler- 
Bellevuc Scales are the most prominent individual tests. Tests for preschool 
children necessarily require individual administration and arc less reliable 
than school-age measures. Precise evaluative standards do not exist for intel- 
ligence. Typically, evaluation is undertaken only in determining mental defi- 
ciency and genius and in academic and vocational selection. 

Measurements of intelligence arc invaluable for educational practice, but 
their use necessitates observance of certain cautions. The 10 reflects both 
heredity and experience and is not a lixed quantity. It is reasonably constant 
and the more so during the elementary' grades. Since the 10 and MA indexes 
were designed for school-age children and for children attending school, they 
should be so used. The correlation between school marks and IQ is about 
.50, higher r s being found for abstract verbal subjects and lower r’s for per- 
formance and activity subjects. Parents will tend to regard the 10 vvith exag- 
gerated significance. Finally, predictions based on intelligence tej^^ing arc more 
accurate for groups than for individuals. 


EXERCISES 

1. Inspect a group intelligence test and indicate tlie following: 

a. Herns that have a cultural bias. 

h. Items that place a premium on language skill. 

c. The dimensions of intelligence that the lest seems to measure. 

2. Study lire manuals ol several tests and make a wiiUen comparison of the 
tests according to; 

a. Clarity ot directions. 

b. Adequacy of standardization. 

c. Reliability. 

d. Coverage ot the dimensions of intelligence discussed in this chapter (see 
page 365). 

3. Compute the missing index in each of the following. Consider that they re- 
late to the Stanford Revision of the Binet. 

a. CA ™ 12, MA = 1 I, IQ ? 

b. CA -- 9, IQ ^ 110, MA - ? 

c. CA 18, MA - lS-6, IQ =- ? 



INTFXUGENCE 


383 


4 Define the important differences in administiation and use between a group 
test of intelligence and an individual one 

5 Administer a group test to a class, score the papers, and compute Mental 
Ages and IQ's for each pupil tested 

6 Expl iin the difference between an IQ and a Percentile Rank as indexes of 
the inttllig'’ncc of .idiilts Between Mental \ge and Percentile Rank as applied to 
school age children 

7 Discuss the different educational uses of nieasuies of intclliccnce, giving 
attention to the \alidit} and reliability of intelligence tests 



CHAPTER 15 


PERSONALITY AND CHARACTER 


While by various names personality and/or character has been a concern 
of teachers for thousands of years, tor most of these years there has been 
little scientific knowledge of their nature, their development, or their measure- 
ment. In the last several decades, however, techniques of psychological and 
sociological research have been applied to the phenomena and currently there 
is a considerable body of demonstrable knowledge about them. Enough chil- 
dren now have been studied in enough different ways to permit the charting 
of trends and patterns in personality and/or character development. And, 
necessarily, the measurement of personality and character has kept pace with 
the scientific study of their nature and growth. 

In the schools, increased knowledge has been followed by increased spe- 
cific attention to the de\clopmcnt of personality and character. Moreover, 
curriculums, methods, and reading material arc being designed in conformance 
with the realities of developing child personality. 

In regard to measurement of personality, however, the advances of psy- 
chologists and sociologists have ha^ less immediate consequence for teachers 
Evaluation of personality and character variables has no doubt become a 
more frequent activity of teachers, particularly elementary teachers, but the 
evaluations siill arc gross and the traditional methods of observation and 
impressionistic rating still arc the ones most frequently used While Buros' 
Fourth Mental Measurements Yearbook lists some 121 standardized instru- 
ments for personality and character measurement, only 17 of these would be 
of use to teachers 

Apparently, there are a number ol reasons lor this condition. Most of the 
tests and other devices for group administration aic }iot sulficiently reliable 
to yield any but the most tentative information about indi\idiials On the 
other hand, the individually administrablc instruments that may be used with 
more confidence require special training for use and their administration 
usually demands far nioic time than teachers have available. A third deterrent 
to accurate personality — character measurement by teachers is the helter- 
skelter state of definition for the phenomena and their dimensions. Finally, 
schoolmen and the public have yet to decide just how much concern the 
school should have for evaluation of personality and character, with the re- 
sult that teachers may more easily dispense with it than with the evaluation 
of subject achievement. 


386 



PERSONALITY AND CHARACTER 


387 


Therefore, like the previous one, this chapter is strictly limited tn scope. 
How-to-do-it treatment is given only to those dimensions that many teachers 
at all grade levels normally evaluate and only to such procedures for; measur- 
ing them as may safely be used without special training or knowledge. These 
are thought to be 

1. The measurement of attitudes and interests that have educational im- 
plications, by means of self-reports and tests of opinion, and 

2. The evaluation of certain character attributes or citizenship through 
observation and analysis of peer opinion. 

Other phases of personality or character measurement will be described 
merely for purposes of orientation to the general processes and problems of 
the field. The technical training in clinical and case study techniques needed 
by school psychologists, psychiatrists, ps)chometrists, social workers, and 
guidance officials is beyond the scope of this volume.^ 


GENERAL CONSIDERATIONS 

Definition of Personality and Character 

The dimensions of personality are for the most piirt covert and inferred, 
for personality itself is an abstraction. Like intelligence, the term is a con- 
struct, the name for whatever pattern of consistency ma> be observed in a 
person’s behavior and, like intelligence, what comprises it or what wc call its 
dimensions is the distinguishing feature of this cvinsistencv . For example, we 
may say that Tom has a bad temper and Joe has an easy one. What we may 
mean is, first, that Tom behaves consistently in one way and Joe in another 
and, second, that Tom’s consistency has frequent angry outbursts as one of 
its components and Joe’s docs not have such a component. 

Character similarly is a symbol for observed consistency in behavior and 
in many contexts it is used interchangeably with personality. Generally, 
though, it refers to such behavioral consistency as has an important bearing 
on a person's moral or ethical reputation while personality usually is a more 
inclusive term and implies no particular social evaluation. For example, a 
“personality” would not be called good or bad but only adjusted, neurotic, 
rigid, flexible, and the like; whereas a “character” would be evaluated as good 
or bad, strong or weak. 

^ NormaII>, these specialists undertake advancefl study in behavioral nicnsurement, 
using such texts as Gladys L, and Harold H. Anderson, Intuniiution to Projanve Tecii- 
niques (New York: Prenlice-Hall, Inc., 1951); J. E. Bell, Projective Techniques (New 
York: Longmans, Green & Co., 1948), R. B Catell, Description and Measurement o{ 
Personality (Yonkers: World Book Co., 1946), L. J. Cronbach, Essentials of Psycholopi- 
cal Testing (New York: Harper & Bros., 1949); David Rapaport and others, Diagnostu 
Psychological Testing (Chicago: Year Book Publishers, 1946, 2 vols.); and, of course, 
the manuals of protocol for important projective tests. 



388 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


Dimensions of Personality and Character 

Dimensions of both personality and character commonly the focus of 
tests, questionnaires, and rating devices are outlined in J'able 31 along with 
the forms and procedures of measurement and the evaluative standards ap- 
propriate to them. The listing is somewhat arbitrary since there is little agreed- 
upon definition lor personality and character or concensus as to their di- 
mensions. 

DIMLNSIONS ARF INIERRID 

As we have obscivcd, all the dimensions ai'^ matters of inference rather 
than observation. For example, we can observe a child jumping away from a 
snake, a youth using a road map, and an adult saving money to send his chil- 
dren to college. But we must infer the fear that made him jump, the belief 
that the map is correct that allowed the youth to follow its directions, and the 
high value of education to the father, which prompted his saving. Moreover, 
the fears, beliefs, and values that we talk about and try to measure usually 
are inferred from many incidents, not just one. and hence are those things 
we call abstractions. Finally, because the children we ha\e observed exhibit 
certain behaviors from which inferences as to fears, beliefs, and values are 
appropriate, we assume that children we ha\e not observed have '‘fears, be- 
liefs, and values'’ in some manner and to some degree 

The measurability of such inferred and abstract dimensions is discussed in 
Chapter 2, pages 26-28. It is stated there that clear definition ol inferred 
dimensions is essential and, because they themselves are not observable, that 
directly related obser\ablc dimensions must be found foi them It is on these 
two knots of definition and related observables that personality measurement 
frequently is fouled. 

FEARS, BELIIFS, AND \ ALOES 

Since attitudes, interests, and character attributes (citizenship) will be 
discussed by themselves, we may confine ourselves here to lears, beliefs, 
values, and certain other variables of personality structure. Fears refer, of 
course, to the items for which the pupil would have fright reactions and the 
degree to which he would be frightened. Beliefs seem to denote those state- 
ments, ideas, concepts, etc., that the individual feds sure about or will act 
upon. That to which values refer is far more complicated, both psychologically 
and semantically, but for purposes of measurement it may be resolved into 
the individuafs basic preferences as to goals and means. A simple instance 
of this is afforded by an adolescent boy as against his adult male teacher in 
the case of a classmate of the boy cheating on a test. The boy probably would 
prefer to cover for his friend whereas the teacher would prefer to discover and 
punish the misdeed. We would say that the boy's primary value in the situa- 



PERSONALITY AND CHARACTER 389 

tion was “loyalty to his friends” whereas the teacher’s value was “to maintain 
discipline.” 

Plainly then, fears, beliefs, and values have some common subdimensions: 
the identity and number of items to which feelings attach, the valence or di- 
rection of the feelings, and the intensity of the feelings. So, in measuring fears, 
beliefs, and values, psychologists must ascertain what pupils fear, believe in, 
and value; whether their feelings arc to approach or to avoid and just how' the 
approach or avoidance is expressed; and, finall>, just how fearful a pupil is, 
how strong are his convictions, and how intense arc his values. 

PERSON ALHY SFRUCrURL 

Personality structure or pattern often is an object of measurement for 
psychiatrists and clinical psychologists. In lablc 31 are shown three different 
types of variable commonly used in describing personality structure. 

Polar Dimensions, The first type consists of the designation of a number 
of polarities with which each personality necessarily is involved; dominance- 
submissiveness, placidity-excitability, and masculinity-lLmininity, for example. 
It is assumed that personalities are ranged from one end to the other on each 
of the polarized continuums and that personalities may be characterized bv 
slating where they arc found on each continuum. 

Personality Tyi)es, A second approach to describing a personality is to 
state its type and how clearly it is that type. The "Types'* illustrated in Table 
31, cycloid, paranoid, compulsive, arc derived Irom diagnostic classifications 
for neurotic and psychotic states. Others (sanguiip, morose, dependent, agres- 
sive) are simply the v\ay people have become accustomed to “typing” one 
another, fhe idea in the establishment and use of t)pcs is simply lo notice 
that there seem to be groupings of individuals so far as personality is con- 
cerned and lo give a convenient name to each group. Usually the basis for 
the grouping and for the name is a single prominent characteristic that seems 
to pcrv'ade and color all activities and aspects of the person. C}cioid indi- 
viduals are either up or down, have extreme emotional swings; sanguine are 
always hopeful and seeing the bright side: dependent arc always clinging to 
a superior or appealing lo authority; etc. 

Personality Traits. Kach of the two approaches to personality charac- 
terization just described deals with the personality as a whole, either classify- 
ing it or placing it in respect lo some polarized behavioral abstraction. The 
third type illustrated attempts, on the other hand, lo identify the prominent 
elements of the perNonality and to indicate their relative strength and inter- 
relationship. These elements may be the very familiar ones we have indicated 
as separate dimensions in Table 31 (Attitudes, Interests, Fears, Beliefs, 
Values) or they may be particular to some psychological viewpoint or per- 
sonality test. Those cited in the table are used with a given projective test 
but arc consistent with current theories of personality motivation. Need domi- 



I ?. 


c 

<n 


UJ 


1-^ 


O O) 


^ -5 


5 G 


g ^ i> 

00 5 3 


5 c 


•0^2 
C C 3 

§ s 2 

p -g ^ 

::i i 

I « i 

D ^ Ui 

— ^ <D 

-TJ <U 

^ -a 

° 5 5 

<L> ^ 

e»L % w 

o ^ «3 
i- <u b 
cij •a ✓ 


^ a 

(T -- 

iS c 

o 2 

i-g 

D. « 

:: c 
g o 


Ui 


«» 

O 

^ O 


c /3 

C/J 'H- 

C <£ 
O OJj 

- o 
& S 




O o 

w Cu 


^ u E 

<L» O ^ 'C 
•3 £ a> 3 


- W 


E 

o. 

o 


:: 

% 3 

t'5 

CL 

i o 

UJ ^ 

E - 


tc 


■s : 

(A I 
C« 

(U 

Ui 

a 

/< 

(i> 


T3 

C 

eo 


C 

O O 


^ 1 


CL u 
tr 

^ O 

■=* £ o. 


bO 

c 


UJ c« 

1 4) 

^ 1^ 

OQ z*' 

H 

o 


x; 


00 w 

- 

J o t- 

4> ^ O 

^ u '*-' 

(u o £ 

£ 73 


I 0 = *1 

’13 O ^ o' 

> « T* 

I- o/j j 

3^ E 

t: X 

O C a E 

^ O Sj 


gfa 

s- s 

I 6 • 

£ O 


X o 
3 


<y Vlli 

e c 

£ C F 

B 3 
t/' cr £ 

- o 


o tS 

E ^ -> 

^ o 

eg e 

3 E g £ 

o c UJ 


a t: 


00 


V 2 


■g 2 

^ -3 


3 C 

wT J 

or - O 

w 73 Q o 
£ ^ 3 

O o ^ 

^ H E 
o C r* 

X g - y 

j C -. o 


•s g 


= -^ s 

£ or 

- <i> c 


CL 0> 3 

O ^ ^ O 

O 

tJj ^ CO 

C 

O / c ^ 

p r 3 U 

E P r* 


3 

O 

P 

o 

E 

o. 

, o . 


E 

~ flj 


'3 7 

U TJ 


o o 


•g 

3 

O 


2 o 


" 2 
o 


S d 

3 "3 


£ I r<; 


3 

c 


o 

o 

X 


390 



T\BLr ^1 (Cominued) 



391 



TABLE 31 iCoiitmind) 


1 g- s 


e - 3 'S § 
2 _C2 'O S 
6 o ^ S 

z; 3 s2 

o. ^ = O 73 

® ^ S S 3) 

v5 

O O 13 . O 


a. 

2 


0 

t/' 

C 




bu 

a> 

73 

OJ 

X 

0 

B 


0 


0 

3 


■d 

3 

73 

V) 

73 

«i-4 

CJ 

If 

«u 




c 



CD 

<3 


5 

0 

3 

& 

73 

0 

w 

E 

0 

-3 

0 

C 

X 

0 

J2, 




73 

a> 

_ 

I***! 

Cm 

ao 



CD 


<ii 

0 

«L» 

73 

cd 

0 

0 

1 -^ 





1 = 3 

e al 

"ti 

^ 5 c 

T3 o c 


H « 5 

O CD 3 


0-1 

-2 a< CD 
^ o 3 

=3 y £ 

O a> CD 


S ^ P 

43 io 3 


O <1^ 

2 *= 
3 a> 


u 'i> ti 
C c 


E G y 

Q -TJ O 

Q a. i 


-73 ^ 

"2 o 3 

o c ^ 

— : 71 p 

::i. 2 o H 
u u w 


e in y 
■3 OJ 

O ^ tp 

73 «J 

73 73 73 

^ 2 O 

Z. Z. Z, U4 


392 


predominant personaliiy stereotype 
for a gi\en group 




6 = 

B w 
? 2 
2 ^ 
B 

c 2 
-a c: 

ri 

o.^ 

e S 


1-3 


O 

Q, 


Ja .5 


J=i ca 

“ S 

<Li 

°l 

in O 
« 


^ <U 

c, ^ 


=3 u; 
Xi C 3 

5-5 


— 

^ C 
X 3 


c 

o 

'in 

"o 

»-* 73 

S " 

*0 2: 


^ 2 X 

ESI 

n> (y 

“ s § 

E c c*< 

^-3 S 

O 

X 73 

^ E 


CO X' “ •.- 


O 

a 


U 73 -C 


^-1 C tr ^ 

: 2 „ I 

- 3 " = 

U bO (L) 

Q 7 s e 


^ Wj 

X o c^ 


w 

a. 

i B 

r! <1^ 
^ -E 


;3 

o. 

c 

(4 


^ g 

c z: 


o ^ 
“ c 


33 =ilj . 
^ 2 


o ^ 

E ^ 


(L» y 

CL 

I 

X -2 X 




§ e 


'3 


B ^ 

CO C 

w A 
c ^ 


•Tj a- 
X 

X c/i 
^ t- 
O o 
73 

tJ ^ 


O 

X o 

33 X 


o 'L> 


c c 

1^ o 2 

o X ^ 

7 «y 
X 


■s ^ • 

Q, ‘ 

>. o 
^ - 

"2 '3 

CO C 


X 

u 


OJ «3J 
Cl. h: 

'JJ n 

ca c 


iX c 

s 

^ yj- 

S o ii 
• E o. ^ 

ZJ v" 

1 .» , 

•73 tij X 
C ♦_ 

— X CO 

. X a, 

§ J E 

X X 

-1-^ (O 


E X 
c c 

o 2 


fc 

-5 


^ 3 


73 

X 

cO 


3 

X 


X 

o 

73 

5 


O 

c 3 


O) L 

5 3 


3 X 
X CJ 
3 O 


X L> — 


C 

o 


X c 

E r> 


^ 73 


7 < 


393 


generally are less \alKl and reliable thin procedure'- of measuring achievement or intelligence Reliability and validitv must be estimated for 
a procedure at time of use and used in the Imht of tl at Miniate 

\gc cxpLctaniit ^ an I rurms foi »iicsc dm r siun** contained in such books as Havichurst, .7 uvan D(\f lopment and Education Gesell 
and llg. Child Dexelupnuni Olson ( lulJ D^\i lopnn nt Jenkins, Shacter, and Bauer, 7 //ot ir^ ) our C fiilJren 



394 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


nance l^|rs to a presumed motivation on the part of an individual to domi- 
nate ommrs. His need might be weak or strong and he would be assigned a 
number indicating its strength. The needs that predominate in a person to- 
gether with their intensities serve as the structure for a personality according 
to this view. 

Forms and Means of Measurement of Personality-Character 

Of all behavioral phenomena personality and character are most re- 
stricted to the least precise of measurement forms, description-classification. 
Extended verbal statements arc about the only valid means those who engage 
professionally in the measurement ot these pfeenomena have to symbolize 
their status. Classification always is possible but so many arc the dimensions 
of personality and charactei, that when you have correctly classified the pupil 
in one way you probably have wrongly classified him in several other ways. 

Classification as well as description is, oi course, applicable to specific 
dimensions of personality and character. Children and youth can be assigned 
numbers or words that stand for a given cluster of interests, beliefs, or values; 
and arbitrary classifications can be contrived for most of the other dimen- 
sions. For example, attitudes toward school might be classified in terms oi 
the age level where they are typical and a child’s attitude toward school ex- 
pressed then as juvenile, adolescent, or adult. 

Ranking also is an appropriate form of measurement for fhose specific 
personality-character dimensions that are unitary in nature Children could, 
with validity, be placed in rank order relative to their interest in athletics, 
their fear of animals, their honesty, and even their popularity. They cannot 
and should not be ranked on such mosaic dimensions as attitudes in general, 
personality structure, or social adjustment. 

Scale measures may not be used for personality-character or any of their 
dimensions, measuring instruments with the essential character of true scales 
having yet to be devised for them. 

The great variety of measuring procedures applicable to personality and 
character are cited in Table 31. As might be expected, the entire repertoire 
of behavioral measuring devices is used; observation, product analysis, free 
response, and guided response, and several techniques arc employed that arc 
peculiar to this area only: self-report, peer rating, interviews, projective tests, 
etc. In ensuing paragraphs the six basic approaches to personality-character 
measurement shown in the table will be explained briefly. Ihose usable by 
teachers in classroom situations will be discussed at greater length in connec- 
tion with the dimensions to which they are most applicable 

OBSERVATION 

The rationale behind observation as a means of personality or character 
appraisal is simple and obvious. The two, as we have stated, arc no more 
than constructs that serve to explain consistent differences in behavior among 



PERSONALITY AND CHARACTER 


395 


individuals. To observe behavior, then, is inherently the most valic^cans of 
measuring cither personality or character. 

Techniques of observation are applicable to all the dimensions cited in 
I'able 31, but particularly so to character attributes. In fad, appraisals of 
character by means other than observation arc held in suspicion by many 
educators and psychologists. The evaluations that teachers report for any of 
the dimensions usually are based on observation. Psychologists and psychi- 
atrists employ special observational devices in analyses of personality struc- 
ture and diagnoses of personality disorders. Children often .ire placed in 
controlled play situations and then observed directly or indirectly, through 
one-way vision screens. The detailed behavior profiles of ml ants and small 
children prepared by Arnold Gesell and now used as norms lor child develop- 
ment were based on repeated observations of hundreds of children in a special 
observation chamber. 

Observation may be casual and of “real life” situations or it ma> be 
lormalizcd and directed toward contrived situations In either event, it is 
necessary to translate ^^hat is observed into a permanent record of pioperly 
meaningful words or numbers. Recording or rating devices Ircqucntly used in 
school situations are illustrated in Figure 59, and a full discussion of obscr\a 
tjonal rating and recording is given in Chapter IV. pages 52-59 

Though observational procedures have great potential validity, much ol 
this is lost in practice. It takes many samples of behavior to \icld accurate 
inferences about the interests, values, or adjustment they reflect and, too often, 
only one or two samples are used. Moreover, each observer brings his oun 
hopes and apprehenskms into the situation and to a degree observes them or 
their effects rather than the actual behavior of the child being observed 
Finally, if the record or rating >icldcd by observation is read by anotlier, he 
may impute widely dilTcrcnt meanings to the dcNcriptions and classifications 
from those intended 

PRODUCT ANAL\SIS 

A second way ol measuring some personality dimensions is to an^dv/e the 
writings, drawings, sculpture, etc of a person. Since pupil products dilfcr 
even when they deal with the same material, it may be assumed that ''Ome ot 
these differences derive from personally character diftercnces in the pupiK 
It remains then only to identify the “personal" aspects ol the prodi cts and 
the personality dimensions they represent. 

Product analysis is employed chiefly by clinical psychologists for liic diag- 
nosis of personality disorders; paintings and drawings arc the principal prod- 
ucts used. In addition, the imaginative writings of older children, youths, and 
adults sometimes arc studied for clues to certain psychotic states. While hand- 
writing long has been reputed to “reveal” the pcrsoniiIit\ and while there are 
reputable persons w’ho have had some success in handwriting anahsis, it 
remains largely the province of the charlatan. 



396 ' CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

In illustration of drawing-personality relationships consider the case of 
an elementary grade pupil whose drawings over a two-year period were en- 
tirely of trees. Always the trees were bent and blown as if in a storm and al- 
ways one tree was broken or split. Suddenly, the storm motif was dropped, 
a sun appeared in the sky in some pictures and, in climax, a drawing w^as 
made in which a split tree hud been mended with cement. This was followed 
by an occasional drawing of things other than trees and finally the boy again 
had the repertoire of subjects for drawing common for his age. 

The boy’s history during this period involved the following. His mother 
had died and his father had grieved excessively but passively. He had not 
permitted the boy to discuss the dead mother nor to display his own grief and 
he not only had not assumed the mother’s role with the boy but, in his sorrow, 
had withdrawn even some ol his previous iathcriy attention. At last, the father 
remarried and the new mother quickly established a warm relationship with 
the boy. 

Product analysis generally is not a device for teacher use. The interpre- 
tation of drawings, writings, and the like requires a more extensive study ol 
both the media and personality structure than is included m the average 
teacher education program. Moreover, even for specialists the procedure 
presents many problems. The element thought to have personality signili- 
cartce may actually derive from something else — a limited supply ('f colors, a 
too small drawing surface, an afiectation his social group unfeowm to the 
examiner, even ineptness in a phase of the production that forces elimination 
or diminution of some 1 actor. The latter is exemplified in the drawing of 
hands. Hands are hard to draw and children may omit them for this reason. 
Yet, according to some diagnostic protocols, the absence of hands or other 
body members has important significance for a child’s sex concepts and ten- 
sions. There is need usually lor a number of products to exhibit the same 
characteristic before any conclusion may be drawn, and always it is advis- 
able to seek from another source further support for any conclusion. 

SFLF-REPORl 

Both of the two previous procedures seek to measure personality and 
character dimensions as they normally are manifested A third means is to 
ask the person himself to report these manitcstations, to describe his own 
personality and character or the past events, actions, and feelings that might 
relate to them. This widely used technique is employed by teachers to gain 
information about interests, by counselors and school psychologists to assess 
fears, beliefs, and problems, and by psychoanalysts to diagnose subconscious 
tensions. The many ways of eliciting self-reports are listed in Tabic 31 and 
certain guided response devices appropriate thereto are illustrated in Figure 
57, 

Self-reporting is as widely criticized as it is used. The procedures are 
eschewed by experimental psychologists and any devotee of behaviorism 



PERSONALITY AND CHARACTER 


397 


simply on the grounds that they involve introspection. Others condemn self- 
reporting on two rather obvious and reasonable points. 

1. Memory is involved and a person’s memory about his emotional states 
and his feeling-toned experiences is apt to be considerably distorted. 

2. Personality, character, and all their dimensions are subject to social 
stereotypes of acceptance and rejection Children and particularly youths 
and adults are aware of these stereotypes and hardly any are willing to report 
themselves in an unfavorable light. 

So self-reporting is apt to produce an inaccurate measure of M personality- 
character dimensions and to provide an unduly favorable appraisal of any 
that society evaluates. It is simply that no one is likely to remember just how 
many bad dreams he has had nor how bad they were and very few ol us will 
admit to acts of dishonesty and cruelty or to feelings of hatred or avari:c. 

For these reasons, self-ieporting, whether by questionnaire, autobioeraohy, 
or interview, is most useful for dimensions for which society has no strong 
and distinct acceptance/rejection stereotypes. 'I'his leaves the proceduic ap- 
plicable to attitudes and interests; to fears, beliefs, and values, as long as such 
areas of strong social feeling as sex and religion are avoided; and to non- 
cvalualed aspects of personality structure, placidity-excitability, tor example 
It prevents much use of self-reporting for measureincnt of character and of 
socially evaluated aspects of personality structure, i.e., masculinity-femininity. 

GUIOrn RI'SPONSE irSIS OI opinion and PRriLKINCF 

A fourth means of personality-character measurement is even more of 
an indirect procedure than self-repoiting. This is the very familiar guided 
response test of attitude, intcicsi, personal adjustment, etc. As their rationale, 
guided response tests of personality or character seem to argue that choices 
expressed among vcibal options in a test situation are indicative (-if the feel- 
ings and concepts that normally guide a person's behavior. The three types 
of item predominant in such tests are illustrated in Figure 58. 

Criticisms very similar to t^'osc directed toward self-reports are voiced 
against the tests. It is far from demr)nstrable that a pupil’s behavior during 
a test has a necessary and clear-cut relationship to given personalit) -character 
variables. Pupils may attempt to choose the response most acceptable socially, 
even refuting their actual teelings to do so. Having no grade to earn, pupils 
may “kid" the test or mark it in a random fashion. And even if test pedorm- 
anco is determined by personality (as it certainly is to some extent) and is 
undertaken honestly and conscientiously, theie is no certitude that a response 
means what it is supposed to mean. 

A simple example of this point of invalidity is aflorded by the first item 
in Figure 58. “Yes No 1. Teachers arc more strict than other people.” 
This item might appear in a test of attitudes tow ard school. If it did. it prob- 
ably would be keyed to a negative attitude toward school, perhaps to the 
specific idea that if the pupil says “Yes” he is willing to impute an unpleasant 



398 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

thing to the teacher and thus has at least one ingredient of a negative attitude 
toward school. On the other hand, one pupil might say “Yes’’ because his ex- 
perience with teachers actually had found them more strict than others. An- 
other pupil might say “No” because of opposite experience. Yet each might 
have similar attitudes toward school or the first could have the more positive 
and the other the more negative. 

Usually, the designers of personality tests are well aware of these weak- 
nesses in their procedure, and they use various means to olTset them. In 
validation studies, it is demonstrated that test behavior and other behavior 
are related. Items and scoring keys are incorporated to detect dishonesty and 
insincerity. Many items arc keyed to given dimensions, not just one. Thus 
through increased sampling the effect of irrational determiners of responses is 
minimized. But despite these and other precautions of test designers, it is 
advisable to weigh the validity of a guided response procedure very carefully 
when it is applied to personality or character dimensions. 

Because of suspect validity, guided response instruments generally arc 
used only for group appraisal and for preliminary screening of individuals. 
For example, before and after instruction in literature it would be appropriate 
to measure the attitudes of pupils toward certain types of reading, but only 
to sec what mean gain there was and not to mark any single pupil’s attitude. 
Or, a group personality test might be administered to truants to detect which 
of them warranted further individual testing in regard to possil:<tc psychotic 
tendencies. 

PROJECTIVE IICHNIQUES 

The free response tests used in personality measurement typically arc 
called projective tests or projective techniques. The reasoning behind their 
U.SC and the pretext for their name is the assumption that an individual will 
project his own feelings and viewpoints into an ambiguous situation. If the 
situation is clear and meaningful then the likelihood is that the individual will 
attempt responses that he thinks w'ill cope with the situation: a true-false test, 
a questionnaire, a game of cards, being scolded by the teacher are examples. 
On the other hand, if the stimulations themselves are meaningless, vague, 
unstructured, as an ink blot, a blank sheet of paper, an indistinct picture, or 
a ball of clay, and the individual is forced to react to the situation, he must 
give it meaning and this meaning necessarily will reflect his thoughts and feel- 
ings. (See Figure 8 for illustrative pictures, page 71.) 

The boy's paintings referred to on page 396 were “projections” of his 
feelings. An action by an cight-ycar-old boy observed by one of the authors 
several years ago provides another vivid illustration of projected feelings. The 
boy had been a reverse truant, he wouldn’t go home from school, and he 
was being interviewed in an effort to determine the cause. After some con- 
versation about himself and things he liked and disliked, he was given a piece 
of paper, a pencil, and asked to draw something. He countered with the usual 



PERSONALITY AND CHARACTER 


399 


“What do you want me to draw?” and was told, “Just anything.”' After a 
minute of sitting and no attempt at drawing, it was suggested that he draw 
his family. He drew some stick figures of various sizes in front of a stereotyped 
house and then stopped. When asked, he identified each of the figures, father, 
siblings, self, but none was identified as mother. “Where is your mother?” 
the boy was asked. He went to work again and drew a reclining stick figure 
below the plane of the others and then scribbled many horizontal lines through 
the figure. “There’s mother,” he volunteered, “there's mother, way down in 
the mud.” 

In reality, oi course, projection is a matter ol more or less rather than 
presence or absence. All behaviors must reflect personality to some extent, 
if there is any significance in the construct, and no stimulation field is likely 
to be found with absolutely no inherent significance Projective riicasuicment 
involves the use of stimulations that contain as few^ clues as possible to an 
expected or appropriate response and the interpretation of responses to sec 
what was projected rather than what was reacted to. 

Several projcctixc procedures have been standardizud end are used ex- 
tensively by psychological clinicians. The much publicized Rorstliach (23). 
the TheniatiL Appcneption lest (23) and the Make a Victitre Story (27/ 
arc among them. 1hcsc standardized tests and other informal projective tesh- 
niques usually are employed to diagnose personality disorders They result 
in descriptive data only or the most gross classifications and thus have little 
pertinence for group measurement. Expensive both in time cJid in the special 
izcd training required to use them properly, the techniques usually arc not 
appropiiate for classroom use 

Pl lR OPINION 

Since Moreno pubJibhed Whe Shall Survixcr the “disposition” ol the 
group has been studied as avidly as the “disposition” of the individual by 
many sociologists and psycholog >ts. Out of their study has come a nev\ area 
of incasurcincnl, sociomctiy It Is not within the sc )pe oi this text to deal 
extensively cither with the dimensions ol group structure and dynamics oi 
with the techniques lor appraising these dimensions.^ Howe\er, two of the 
many sociomctnc procedures arc means to assessing social aspects ol the m- 
dividiiafs personality or character and they are amenable in use b> teachers 
Moreover, one of these, the sociocram, can provide quick informatu n ahout 
the who-likes-whom structure of a class and such information oltcn is of 
importance to teachers and guidance officials. 

Underlying the use of peer opinion to measure personality c^r character 
seems to be the following rationale: a pupil’s feelings, ideas, mannerisms, etc., 

- Ner\{>iis and Mental Diwase Monograph Senes, No. 58. 1934 
Fur further infoimation on gioup dynamics and its measurement see is'-ucs of the 
journal, Socloinctry, and Dorwiii Ci^rtwright and Alvin Zanders (eds ), Group Dynamics, 
Reseanh and Theory (Evanston: Row, Peterson and C>.. P>51) 



400 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

are more Kkely to be expressed to his classmates than to his teacher. Thus 
peers arc in the better position to observe each other’s “real” personality. 
Since children and youth are avidly interested in one another, they do observe 
one another carefully and critically. They may not be as skilled in unbiased 
observation and m its translation into words as teachers are, but this lack 
can be compensated for by proper sociomctric techniques. In several studies 
of leadership and adjustment, peer opinion has been lound to be more prog- 
nostic of success than any other single factor and always it is positively related 
to the other indexes. 

Peer Ratings. A frequently used but still novel technique lor classilying 
pupils according to gross personality or character stereotypes is the “Guess 
Who” test pioneered by Hartshorne and May many years ago (16) As illus- 
trated in Figure 52, the gist of the piocediire is to give brie! descriptions ot 
certain types of childien or youth and then to ask pupils to name which of 
their classmates fit each one of the descriptions. In demising the de^^criptions 
it is essential to phrase them in pupil language, to make each stereotype dis- 
tinct from any other, and to have no contradictions within any stereotype 
Guess Who tests may be used to i lassify pupils into rough categories of social 
adjustment, manners, work habits, honesty, etc 

Here aic dcsciiptions of cc\cial gills m >oiii clas^ ( an 5011 ^ncss who 
' Ihov are’ Write in liie n.imc ot the giil >oii think l ah (.me n 

She lias to have her own \ iv tu will bri^ak up the g imc | 

I and IS \Lr\ l-)oss\ and selfish 

I 

' _ She IS so sh> that she setrns to be ifiaid lo talk, would ncser 

^ contiadat anyone, and gets nervous when anyone pa>s lu 1 some attention 

_ She is alwavs helping oUri pupils, i\c\u tiics to get credit 

I for things herself, talks veiv well when ^hc is with het fneuds, but not so 

well in a huge moup, 

I 

Figure 52 . lllusUalivc Cuiess Wlio items used to measure the reputation pupils have 
with their peers 

Pupils can rate each other directly, of course, just as teachers rate pupils, 
using appropriate graphic or other kinds of rating scales (see Figure 59). 
Somewhat finer classifications can be made by diicet rating and more dimen- 
sions can be covered m a given instrument, but a Guess Who procedure is 
likely to be the more reliable. 

The usefulness of either direct rating or a Guess Who test by pupils is 
limited by several considerations Many pupils will be hesitant to rate any of 
their peers negatively, some will overrate their friends and condemn their 
enemies, others may feci that it is not their business to evaluate one another, 
and each will be likely to discuss how he rated the others, hence it is necessary 



PERSONALITY AND CHARACTER 


401 


to have excellent rapport with a class before attempting to use the ’devices. 
Confidential treatment of data must be pledged to the pupils and they in turn 
must be enjoined to keep mum on how they marked one another. 

Sociograms. The making of a Sociogram is illustrated in Mgurc 53. In 
the procedure pupils are asked to indicate their friendship or admiration 
preferences (dislike, acquaintanceship, or any other basis for discrimination 
may be used). From these expressions, tabulations may be made for each pupil 
and a diagram may be drawn to show the pattern of attiaction, admiration, 
etc., within the group. 


Pupils induatL how tluy like one another h\ unswaine qiustions such as this 

“Put down the name of the boy oi girl yon vsould most like to vvoik with 
on a piojcct” 


Tluir usiHuists mas hi ttihuhdid on a n,ti!n\ ((ha,t) like this 


1 . r om 
2 Bkity 
Hill 

4 Helen 

5 Man 

I (). 1 Isic 
r ic. 


1 om 
X 


Lit tlx 


Hill 

I 


Helen Mary Msk Ptc 


1 


1 I 


Total 

0 

1 

0 


lni! 'nadi uiapl ofi (' siH’^ufUfU h‘ i >his 


© — 0 



Figiiie 53. Simphlicd illustiation of the compilation of a sociogram. 

From either the tabulation or the sociognini mn> be derived indexes of 
populaiity, acceptance, etc., for each pupil. The raw tabulations can be con- 
verted into rank or classification symbols. In addition, information about the 





402 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

class as aj^whole may be obtained from the sociogram Figure 54 shows the 
many|)^iblc individual and group measures a sociogram may yield 



t igure ')4 Illustrative sociogr tm of tiitiidship pattcins 


Popularity scores could be determined for each of the pupils, by counting 
first choices 2, and second choices 1. With such a svsteni, pupil 17 would 
have a score of 10 and his populaiity rank would be 1 wheieas pupil 7 and 
14 would have scores of 0 and their popularity ranks would be 22 5 Or the 
pupils might be classified according to the number ol first choices they re- 
ceived. as 

4 first Pupil 17 

5 Pupil 8 

2 Pupil 5, 18 

1 “ Pupil 1, 2, 4, 6, IK 12, 15, 

19, 20, 22, 23 

0 “ Pupil 3, 7, 9, 10, 13, 14, 16. 21 

Definite characterizations could be assigned to certain pupils on the basis 
of the sociogram Pupil 17 is a leader of both sexes. Pupil 8 is a girl leader 
and 18 is a boy leader Pupil 14 is an isolate who apparently has no feelings 





PERSONALITY AND CHARACTER 


403 


of friendship for any of his peers. Pupil 7 is similarly isolated but 6ocs indi- 
cate a desire for friendship with a boy and a girl. '4 

In addition, the illustrative sociogram might support the following descrip- 
tive statements about the class as a whole. The group is organized around 
pupil 17, the organization cuts across sex lines, and is fairly pervasive. There 
is one girl clique, pupils 1, 2 and 5, and a male mutual pair. The presence of 
only one true isolate and the wide distribution of first and second choices 
indicates a friendly class that probably has been together for a long time. 

Individual measures derived from sociograms have the same application 
as measures derived from other procedures of personality-character appraisal 
Obviously, they are particularly useful for such dimensions as popularity, 
leadership, and friendliness and they can be indicative of social adjustment or 
maladjustment. Group data have pertinence for group control, seating, com- 
mittee formation, and guidance actwities. 

As with peer ratings, the sociogram is a confidential document and pupils 
will give frank answers only if they trust the teacher. Questions on which the 
charts are based must always relate to a specific activity c r situation and no! 
to jcelinj^s in general. While the latter may sometimes be inferred from the 
responses, the use of specifics minimizes the anxiety that pupils may have 
about their social status in general. 

Because the expressions of children and youth about one another are often 
Lapricious and arc based too much on recent events and because their feelings 
for one another are subject to change without notice, only the most tcntati\c 
conclusions may be derived from sociogranib. Always other supporting evi- 
dence should be required before action is taken. Never should pupils be given 
“therapeutic" counseling, seating plans be changed, or committees be com- 
posed solely on the basis of a soeiogram. 

I>RO(EDIJKtS rOR MTASI RING PI RSONAl 1 1 > -CIJ ARAC 1 LR SL’MMARI/l O 

To summarize this section on procedures, there seem to be six basic ways 
of measuring the personality or ch »racter we impute to individuals because of 
the characteristic corudstcncy w'e observe in their behavior. We may look at 
and listen to what they do (observation). We may study wh.it they ha\e made 
(product analysis) We may ask them what has happened m their past and 
how they ha\c lelt (self-report). We may determine how thev think and feel 
now (tests of opinion). Wc may sec how the) structure an unstructured situa- 
tion (projective tests). And, finall), wc ma\ ascertain their reputation or 
popularity among their peers (peer opinion). As a rule, projective tests and 
product analysis require special knowledge and technical skill beyond that 
possessed by most teachers 

Evaluative Standards for Personality-Character 

Few things are more subject to evaluation than a child’s personality or 
character. He makes and loses friends, wins and displeases teachers, dominates 



404 CUST01V|ARY USES OF MEASUREMENT AND EVALUATION 

or submits to groups according to the value others place on them. While they 
usually a!ce not marked in the same sense that subject achievement is marked, 
many report cards iiave space for teachers to check their satisfaction or dis- 
satisfaction with the pupils’ citizenship (Figure 55). And if they are not 

graded per se, evaluations of per- 

PERSONAL AND SOCIAL 
GROWTH 


sonality and character often affect 
the subject marks given by many 
teachers. 

On the other hand, precisely 
defined evaluative standards are 
almost totally lacking for the two. 
Those that are used, as specified 
in Table 31, tend to be vague, 
subjective, based on philosophy 
rather than practicality, and 
otherwise tend to violate the con- 
ditions of valid standards (sec 
pages 193-1^)4). Often evalua- 
tions are made without reference 
to any known staiidaid. A teacher 
or counselor simply translates his 
impression of the^ pupil directly 
into an evaluative symbol. 

Standards in use seem to iall into five principal categories: norms, ideal- 
ized conceptions, characteristics of criterion groups, laws, and clinical defini- 
tions. The first, norms, probably arc the basis for most judgments. 


Liifad below are deiiroble troih which ^ha school and home seek to dovetop 
in children A check indicotei satisfactory growth in the traits listed below 


Contributes to group planning and thinking 

Plans work wall _ _ - 

Complatei work bi'gun — — — 

Works eo-oporolively with others — 

Listens well 

Uses time to good advantage — 

Is courteous ond considoroto — 

Takes active port in Phyiicol Fducation 

Shows good sportsmanship — • 

Practices goud health habits 

Observes safety rules — 

Respects property of olhen 

Figure 55. Example of a form to icpoit on 
citizenship items. (From a leport caid used in 
Sacramento Countv, Caldornia, Flementaiy 
Schools) 


NORMS 

Norms may be the attitudes found in sixth-grade boys or the average 
sportsmanship of secondary pupils. Their use involves the assumption that a 
group mean or one of its extremes epitomizes the status of most value and 
lesser values accrue to deviations from this point. 1 he most valid instance of 
norms as evaluative standards may be in the evaluation of social development 
where usual sequences of development and the range of behaviors usual at 
any age constitute the norms 

IDEALIZED CONCEPTIONS 

The second category ol standard, idealized conceptions, is to the educator 
what pure gold is to the assaycr. To the assaycr, value incrc ises as the metal 
contains more gold and less of other elements. To the educator, the value 
of the pupil's character increases as it contains more of the traits described 
in the idealized conception and less of other, usually antithetical, traits. An 
idealized conception of personality or character or any of their dimensions is 
a description of a condition of maximum value. The description is the “pure” 



PERSONALITY AND CHARACTER 


405 


gold and as pupils seem to approach it they are given high evaluations. An 
example might be 

Co-operation: Always thinks of the group ahead of himself; works as hard on 
tasks others initiate as on his own; volunteers his talents as needed but withholds 
them when they are not needed; can change from leading to following according to 
the demands of the task without ego involvement. Etc. 

CRITLRION GROUPS 

Tlic characteristics of criterion groups constitute a third important type of 
evaluative standard for these phenomena. I he usual process in this type of 
evaluation is to select a group who epitomize an extremity of value for the 
phenomenon in question and then to record the characteristics of the criterion 
group so that any individual may be compared with them. An example of the 
use of criterion groups may be seen in the evaluation of vocational interests. 
wSuccessful practitioners are selected in several important occupations or oc- 
cupational areas and the responses of these persons arc obtained to :nany 
questions of opinion. An individual is considered to have '‘an interest” in the 
(Kcupation in question to the degree that his responses to the same or similar 
questions of opinion are like the responses of the criterion group. 

Standardized interest tests demonstrate a further convention in the use of 
criterion groups as evaluative standards, namely, the incorporation of a stand- 
ard into the measuring instrument so that the test score is immediately an 
evaluative symbol. First, a number of items are prepared that discriminate 
between the criterion group and people in general. Then a test is prepared and 
administered to the criterion group. The item by item responses and/or the 
total scores of criterion individuals are recorded. From these a scoring key or 
score-interpreting device is prepared that automatically shows how any testec 
compares with the critcrivm group. As his item responses or score approximates 
the item responses or score of the criterion group he is assumed to have an 
interest in the occupation it represents. 

The standards of idealized ccmccptions and norms often arc similarly in- 
corporated into measuring devices so that evaluations may be made or read 
automatically rather than as a separate step. The rating “scales” used for 
citizenship or character express evaluations rather than measurements only 
(see Figure 59) whenever their numbers or intervals stand for range in a 
norm group or for degrees of approximation of an idealized conception. 

LAWS 

The use of laws as evaluative standards for personality and character 
dimensions occurs primarily in connection with determinations of insanity, 
immoral conduct, and professional incompetence. Both statutes and common 
law restrict the activity and authority of “insane” persons and many statutes 
define penalties for certain types of behavior. These laws arc the standard of 
value for courts in judgments about insanity and immorality. In much the 



406 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


same fasi^on, states, municipalities, and occupational associations have de- 
scribed conditions under which licenses may be revoked, tenure covenants 
broken, Ind civil service employees dismissed (see Figure 56) The wording 
of these laws and regulations as to competency, honesty, loyalty, moral las- 
situde, and the like are the standards that courts and commissions apply to 
the cases presented to them 

ARIICLF 2 DISMISSAL OF PFRMANFNT FMPIO^rFS 
fFciiiCdtion Code, Stale of California) 

n521 No pcimanent employee shall be dismis cd except for one or more 
of the following causes 

(a) Immoral or unpiotessional conduct 

(b) Commission, aiding, or ad\ocating the commission of ciiminal svn 
dicalism, as piohibited by Chapter 188, Statute of 1919, or in any 
amendment thereof 

(c) Dishonesty 

(d) Incompetenc 

(c) Lvidcnt unfitness foi scivi^^c 

(f) Physical or mental condition unfitting him to instiust oi issoci.dc 
with childicn 

(g) Ptisislcnt violilion ol, or refusal to obey tlu school Iiws^tit ths 
State or reasonable ugulatrons prescribed for the government of 
the publn. schools by the State Board of Education or bv the gov 
cinmg board of the sthooi disliut employing him 

(h) Conviction of a felony oi of my crime involving moral tuipitudc 
1 tc 

Figure ‘'6 Fxample of a Icgil evaluative standard the dismi'-sil clause in the 
C alifoinia teacher tenure law 

C FINICAL DI 1 INIIIONS 

Clinical delmitions, ihc final type ol evaluative standard tor personality- 
diaractci to be discussed are the standby of psychologists and psychiatrists 
through experience and research they have established many categories ol 
maladjustment, neurosis, and psychosis 1 hey have found that certain expect- 
ancies as to duration and development belong to each of the categories, that 
given types of treatment arc more satisfactory for some than others, and that 
stereotypes of background and etiology terid to accompany ea^h 

The psychological and psychiatric professions have written in their texts 
and case books definitions of these many states ^ After interviews and tests, 
the examiner may evaluate his patient’s behavioral ailment by naming it 
schizophrenia, paranoia, hysteria, neurasthenia, or some such This means 

The definitions are not precise and they shift from year to ycii and from writer to 
writer 



PERSONALITY AND CHARACTER 


407 


simply that the patient’s condition has been judged most like that^ definition 
bearing the name used. (The definition may be subjective and baito, on the 
psychopathologist’s experience rather than writings. ) 

Of the five types of standard described for use in personality-character 
evaluation, only norms and idealized conceptions are likely to be of use to 
teachers in their evaluations of pupil behavior. To ascertain and to prepare for 
use the characteristics of a criterion group is so time-consuming and technical 
a task as to make the criterion group useful only in standardized testing 
Teachers usually do not make legal judgments; and clinical definitions, of 
course, should not be used by any but graduate psychologists and psychiatrists. 

Making Evaluations of Personality or Character 

in basing evaluations of personality-character dimensions on norn-s and 
idealized conceptions, it is especially necessary to pay strict attention to prin- 
ciples of valid evaluation (see pages 193*-205). The norms and the concep- 
tions arc expressed verbally and inevitably they involve the bias of the evalua- 
tor. There must be safeguards against the errors that deri\e from either of 
these conditions and other aspects oi the evaluation must he indefectible so 
that error vill not be compounded. 

No research is available to indicate the proper way to mark personahh- 
character dimensions nor does the collective experience of teachers offer any 
consistent precedents. From the general tenets of good marking, however, we 
may derive a few guides for practice. 

1. If citizenship or any personality-character variables are to be graded, 
mark them separately and as such. Do not allow them to he a component in 
(inv subject or skill mark, 

2. Mark the dimensions of character or citizenship separately. As ign an 
evaluative symbol or symbols to each of the factors >ou weigh: neatness, punc- 
tuality, honesty, regard for others, etc. Do not mark cfuiractcr or iitizcfLship 
as a whole. 

3. Use verbal evaluative statements rather than letter marks where pos- 
sible. 

4. There is no special merit in any particular set ot evaluative symbols 
{A, B, C, D, F; S, U; S, U, O; E, G, F, P, etc.), when they arc applied 
to dimensions of personality or character It is easier, ot course, to use a di 
chotomous or trichotomous set rather than five because there are fewer lines 
where close decisions arc involved. The nature ol variation in personality- 
character dimensions, however, may be better suited to five valued judgments 
than to two or three. 


ATTITUDES AND INTERESTS 

A pupil’s attitudes and interests are significant to teachers in two ways. 
First, they affect what and how efficiently he learns and, second, changes in 



408 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


them often are a specific objective of instruction. It is easily demonstrable 
that children tend to remember things they like and to forget what they dis- 
like, that boredom makes instruction ineffective, that a pupil’s interest in 
fishing, say, can be exploited in a reading lesson. Moreover, the purpose of 
schooling is not only to teach facts and skills but also to teach appropriate 
attitudes and interests. 

I he terms “attitudes” and “interests” seem to refer to the relatively mild 
feelings stimulated by the things that flood in and out of a person's experience. 
In the most prevalent usage the word “attitude” more often is reserved for 
feelings about things external to the person: school, war, minority groups, 
political parties, etc. Interest, on the other hand, usually denotes a Iceling 
attached to an activity in which the person may engage: reading, baseball, 
bridge, painting. An additional difference between the two is that attitudes 
may range from negative through neutral to a positive valence while interests 
may range only from a neutral to a positive feeling. Of course, the idea of 
aversion can be and often is placed as an opposite pole to interest and then the 
variation is the same as for attitude. Obviously, the two are so allied in mean- 
ing that they may be used interchangeably in many situations. “What is your 
interest in school?” and “What is your attitude toward school'^” mean essen- 
tially the same thing. In our treatment we shall use “altitude” where it is the 
usual term and “interest” where usual, hut the means of measurement pre- 
sented are equally applicable to both 

Dimensions 

The more important dimensions of attiiudcs-intercsts are 

1. The things to which feelings attach, 

2. The valence of these feelings, and 

3. The intensity of the feelings. 

Among the categories of feelmg-toned things important to teachers seem 
to be the following: 

School subjects and impoitant parts theieof, such as grarnm.u m I nehsh apcl 
dates in History. 

I eachers 

School and its components 

Fntities or abstractions having cultural significance- 

Science 

Religion 

Country 

War 

Freedom 

Work 

Minorities 

Etc. 



PERSONALITY AND CHARACTER 


409 


Reading matter 
Recreational activities 
Occupations and their aspects 


Measuring Procedures for Attitudes and Interests 

Measuring attitudes and interests involves the determination for any pupil 
of the things that arouse leeling, the intensity of the feeling toward them, and 
Its valence or direction Ti the things arc known or arc not themselves in ques 
tion, the procedure has to do only with determining valence and intensity 

Sr LI -RI PORIING 

In Figure 57 are several illustrations of guided response items used to 
measure attitudes and interests through self-report procedures The fi»st 
sample is possibly the most common type in use In eflect, it asks the pupil 
to rale the degree and direction of his feelings toward outdoor recreation The 
gradations of feeling nia> be increased or decreased, the manner of statement 
may be varied, and the item is applicable to anything for which a pupil can or 
will express his leelmgs 


1 

^ AlU 

1 1 like to pi IV 1 

outdoor guiKs 
and sports 

1 

Fi equcntly 

Stldom 

Never 

1 

Fte 





>es 

1 1 vvoiild T ithei listen to bind lb m pi ly in one 

No 

1 tc 

Pul YLS by the things vou like and no by the things you 

dislike 


Swimming 

Poeiiy 

Movies 

_eomie books 

C hcv. kers 

C mist i 

Crossword puz7lcs 
t te 

\ Iventure 

R mianec 

_ Outdoor things 
_ Pu/zles 

Mysteries 
_ Animal life 

C owboys 

Fk 


higuTC S7 Fxamplcs of guided response items used m getting pupils to report 
on their own inteiests and altitudes 





410 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

The No” item following is again widely applicable but any given 

item measures direction only, not intensity of feeling. Intensity can be gauged, 
though, by having several items imply different degrees of feeling about the 
same thing. Another way is to consider that items indicate equivalent intensity 
but to assiflUe that the more items are marked in a given way, the more 
intense are xhe pupil’s feelings in that direction. 

The third sample is an inventory or check list of interests and aversions. 
Feeling-toned objects and the direction of feelings arc measured but not in- 
tensity. 

Total scores may be derived for such guided response self-report instru- 
ments. For items of the first type, the degrees of feeling may be weighted 
1, 2, 3, 4; —2, — 1, +F +-2, or whatever is convenient. A summation may be 
made of the weights any pupil has indicated by the way he has marked the 
items. The second type yields a total “Yes” score and a total “No” score or, 
by counting Yes as 1 and No as zero for the reverse), a single score can be 
obtained that means all No at one extreme and all Yes at the other If yes’s 
and no’s are keyed variously to positive and negative directions, a key must 
be prepared and the opmionnaire scored in the fashion of a true-false test 
(see page 98). Total scores for the third procedure may be obtained by 
counting the checks given to things in the same category. If the items in the 
list constitute a single homogeneous grouping, the total number of checks is 
the score. 

Guided response self-reporting instruments arc amenable to preparation 
and use by teachers for the measurement of attitudes and interests. In addi- 
tion to adhering to general rules for instrument construction (see Chapter 6, 
pages 90-1 18) it is well to observe the following cautions. 

1. Ask only for feelings of which pupils are likely to be conscious. 

2. Expect no meaningful results if items relate to areas of strong and 
stereotyped group feeling, i,e., sex, religion, morality 

3. In extracting total scores, add only the items which refer to the same 
category of object and to the same valence of leeling (unless differences in 
valence are accounted for by weighting). 

4. Rank numbers or classification symbols are the only derived scores 
appropriate to self-reports of attitudes and interests. 

5- Key suHicient items to the attitude or interest object in question to ap- 
praise it fairly While no general rule is possible as to the minimum number 
of similarly keyed items needed, certain considerations have broad applicabil- 
ity. Complex feeling-toned objects need more items than simple ones (i.e., 
interests in reading would require more items than attitude toward a given 
hook). Guided response procedures usually are less reliable for the measure- 
ment of feeling than for the measurement of achievement, hence longer in- 
struments are indicated to compensate for this. If a given object for which 



PERSONALITY AND CHARACTIR 


411 


feelings are to be gauged has several distinct parts or phases, one or more 
Items should be keyed to tcelings toward each of these 

6 A pupil’s responses to c^ch item in themselves constitute descriptive 
data about his feelings and total scores olten may neither be desirable nor 
particularly meaningful 

Less precise numcncally but as valid potentially, measures ot interest 
and attitude may be obtained by tree-response sell reports and by interview 
1 hesc pioccdures will [)roduce onl> descriptive or at best classifiv^atory data 
and they arc ]ikel> to suffer from incomplete coverage of any attitude or inter- 
est area Pupils’ tree vvriting and talking usually is unsystematic and the first 
things thought ot are likely to be exploited at the expense ol others equally 
important in the pupils lile A -.cries of leading questions foi writing and a 
carefully structured interview can overcome some of this weakness 

riSlS OF OPIMOM 

In self -reporting the ap( roach is direct ‘What have be^n and what are 
your attitudes'? The second method of use to the ttachcr is lc^s direct “What 
are your opinions about certain things? From ihcsc 1 nia> inter the nature 
of your feelings p ist or present 

In Figure 5b the three most widelv used t^pcs of guided response item arc 
shown as applied to the me. isurcmcnt of opinion true talse multiple-choice 


Yes No 1 JcaehLis iic more stiiet thin other people 


1 U 1 yvere on the st ifl of i newspipti 1 would prefer to 
i Write 1 1- jnioi olui in 

h Wnie cdiloiiils on soeiil issues nel pe^htus 
e Be the dime lepoi^ i 

d Wiite fcituie stones on inteiesiint people md unusual events 


Suppose thit you knew people in each of the foBowin^ occupational 
( iteeories Whit rel itionships would vou like to have v^ilh them ^ Indie ite 
this b> mitehint on^ or moie numbers fiom the the light hand reiliimn 
with eaeh oeeupalitm in the left hand cohnin 


1 iw>ei 

Bis eliivei 

1 e lehei 
S ulor 

Store cleik 

1 oan shop opei itoi 

Fte 


1 Pei form a service for jou 
Spe ik to in a fiiendlv win 
^ IiiMtw to \our heiusc 
4 Ask to join your club 
s Be >our best fiiend 
0 Ma r\ 


Figiiie ‘'S Fx iniples of guided response items u ed to measure pupil opinion 


412 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

and matching. The only difference between their application here and to 
achievement (see pages 106-109) is that pupil responses purportedly are 
based on feelings rather than on knowledge or skill. As with their use in 
achievement, each item must have definite keying. In this instance, the keying 
is to a given feeling-toned object, to the direction of that feeling and, perhaps, 
to its intensity as well. 

Agree^Disagree Items. The way in which the sample items in Figure 
58 are keyed may serve to illustrate the process. The first item is keyed to a 
pupil’s attitude toward teachers. 1 he assumption is made that pupils who dis- 
like teachers are apt to magnify their strictness. Consequently, an item is 
devised for agreement or disagreement which expresses an exaggerated view 
of the relative strictness of teachers A yes response to the item means that 
the pupil agrees with the exaggerated opinion and, hence, according to the 
assumption, dislikes teachers. A no response means that the pupil rejects the 
extreme view and, hence, probably docs not dislike teachers. This and other 
yes-no items are keyed to object and valence but not to intensity of feeling. If 
many items are keyed to the same thing, total scores can be considered 
indicative of intensity just as were total scores for yes-no self-report items. 

Paired Comparisons. The second of the items in Figure 58 illustrates 
comparison keying. Here, the pupil is required to compare his feelings for 
each of the four activities stated as options. His choice presumably indicates 
that he likes the activity chosen better than any of the others. Each of the 
activity options represents or is keyed to a different type ot interest category; 
thus, the option that a pupil elects is indicative ot his prefdominant interest 
category. In the illustrative items these might be a tun, b. ideas, c. excite- 
ment, d. people. 

Many standardized interest and value tests employ this type of item. Gen- 
erally, they force many comparisons of each interest or value in question with 
every other. Hence, scores may be indicative of intensity ot feeling as well as 
valence and object, intensity being inferred from the number of things over 
which the thing in question is preterred. 

Matching Items. The last item in Figure 58, the matching one, is a 
particularly flexible and efficient type. Several objects of feeling can be equated 
with as many or more expressions that indicate directions and/or intensities 
of feeling. In the one shown, an example of the so-called social distance device 
first conceived by Hogardus, the testee is asked to state the relationship he 
would be willing to have v/ith persons in different occupations Each relation- 
ship option varies in intimacy; it is assumed that more intimate relations are 
offered only to persons toward whom one has strong positive attitudes; there- 
fore, the item is keyed to valence and intensity of attitudes toward occupa- 
tions. The matching type of attitude-measuring item, par'icularly the “social 
distance” variant, frequently is employed by sociologists in surveys of the 
prestige or status value of different occupations, nationalities, religions, etc. 



PERSONALITY AND CHARACTER 


413 


Selecting Items. The selection of items for opinionnaire measurement 
of attitudes and interests is more complicated than the selection of self-report 
items. There it is simply a matter of asking efficient questions about how the 
pupil feels or has felt. In the present case, however, stereotypes of opinion 
must first be determined that are thought to represent various feelings and 
items must then be phrased to reveal which stereotype any respondent prefers. 

The sources of the stereotyped opinions are groups of people who may be 
assumed to have given feelings. For example, it could be assumed that truants 
dislike school and that honor pupils like it. Fhe typical statements of opinion 
made by each group about school would then be the stereotypes from which 
test items might be devised. 

Scoring Tests of Opinion. Guided-rcsponse tests of opinion arc scored 
much as are guided response tests in general (see page 98). A key is needed, 
and the papers are marked according to the key. Rather than being correct or 
incorrect, options are indicative of different valences and intensities of feeling. 

If attitude or interest in only one thing is being measured, a single score 
may be derived by counting only those responses that display a given feeling 
toward the thing in question. The first example in Figure 58 is a type of item 
often found in interest-attitude tests with such a single focus of measurement, 
yielding a single score. Pupil responses to a number of these items would be 
credited or not according to the key and those credited added to give the 
pupils’ scores. Thus, in a test containing items like the one in the illustration, 
a pupil might have a score of 27 tor his attitude toward school, another might 
have 35, and a third, 40. This should mean that the third pupil liked school 
better than the second, the second better than the first. 

The second item in Figure 58 is a sort that is often used in tests with a 
multiple focus of measurement Through one test there is an attempted ap- 
praisal of feelings about many things. Obviously, single scores are ob\iated for 
such tests and as many scores must be obtained as there are different fecling- 
temed categories involved. The illustrated item might be from a test of “p^^" 
dominant interests'’ and a separate score should then be obtained tor each 
category of interest involved in the test. Each item might or might not have 
an option keyed to each cal 'gory, but all the responses that favored “fun" 
would have to be added, all that represented “ideas." “excitement/’ '‘people," 
and other categories would have to be added and the score for one pupil’s 
paper would be a series of numbers, as for example, 


Fun 

1 ^ 

Ideas 

- 17 

Excitement 

— 4 

People 

~ 20 

Etc. 




414 CUStOMARY USES OF MEASUREMENT AND EVALUATION 

Such scores would be interpreted to mean that a pupil’s interests were 
more in the areas having high scores than in those having low scores. The 
scores used as an example show that the pupil was slightly more interested in 
people than in fun or ideas and was little interested in excitement. 

Significance of Scores. Because the scores yielded by tests of opinion are 
based on items whose equivalence or relative significance is indeterminant, 
they may be considered indicative of rank order only. Moreover, the rank 
indexes they yield necessarily have a relatively high standard error of score 
(see page 171 ). Hence, conclusions should be based only on wide dillcrences 
in rank and even then conclusions must be considered tentative and gross, 
much less dependable than those based on a 'hievement and intelligence tests 
The extent of rank differences needed foi conclusions about interests and 
attitudes is shown in the Kuder Prelercncc Record (Science Research Asso- 
ciates) where only scores above the 75th percentile and below the 25th per- 
centile are deemed significant of interest on the one hand and aversion on the 
other. 

Sometimes measures arc not desired lor each pupil’s opinions but rather 
for group opinion with regard to certain things. While the third sample in 
Figure 58 is useful for measuring a given pupil’s attitudes, it also is appro- 
priate for measuring group attitude toward occupations as well. Group 
attitude is the determiner (^f prestige so, in effect, this social distance measurer 
can gauge the prestige value of various occupations. Scoring papers for this 
purpose involves tallying the way all the tested persons respond to each item, 
rather than how each person icsponded lo all the items. We derive occupa- 
tional scores rather than pupil scores. 'I his type of scoring can yield informa- 
tion of value about books, games, assignments, etc., as well as occupations. 

It is expected that teachers can and will devise their own tests of opinion 
to gauge pupil attitudes and interests. Many w'ell-dcsigned standardized tests 
are available, of course, and these may be Ciuploycd whenever they suit the 
instructional or guidance needs of a particular school situation. I he observa- 
tions just presented may be helpful in the construction of tests of opinion, as 
may be those made for self-report instrunicnt> since the two have much in 
common. In addition, it is well to pa\ strict attention to general rules for the 
construction of guided response instruments (sec pages 90-118). In devising 
and using tests of opinion it must be assumed that they arc less valid and 
reliable than the tests of acnievement that the same teacher has devised. Be- 
cause of their susceptibilitv to insincere and false answers, no conclusions 
should be drawn without corroborative evidence. 

Standards in Attitudes and Interests 

As a rule, marks are not assigned to pupils’ attitudes and interests, even 
when they relate to a given subject, but only to achievement in that subject. 
Hence, evaluative standards have little practical significance for teachers. The 
one major exception to this generalization is in the case of vocational interests. 



PERSONALITY AND CHARACTER 


415 


Vocational interests must be evaluated because they are necessary factors 
in vocational preparation and the secondary schools have undertaken some 
degree of vocational preparation for youth. The applicable standards are, of 
course, the status of interests for successful practitioners of diHerent vocations 
and the interests necessary to enjoy or at least to tolerate the training neces- 
sary for the vocations. The former have been formalized in the norms of pub- 
lished standardized tests and arc applied automatically in the testing process. 
The latter arc nearly all subjective standards, ideas in the minds of teachers 
representing their experiences and surmises. 


EVALUATING CHARACTER AND CITIZENSHIP 

The other aspect of the personality-character phenomenon for whose 
measurement teachers have particular concern is the matter of character 
itself. It used to be the term used to denote the total individuality or charac- 
teristic stale of being for a given human, but for this broad signification it has 
largely been replaced by the word personality. As we stated earlier, character 
now is used most often to refer to that which other persons evaluate from a 
moral and ethical point of view. Its dimensions, therefore, have no necessar> 
psychological import but are simply the terms applied to aspects of behavior 
that may be judged desirable or undesirable, right or wrong, good or bad. 

Dimensions of Character or Citizenship 

Some of these attributes were listed in Table 31' co-operation, critical 
thinking, friendliness honesty, leadership, neatness, obedience, self-respect, 
sportsmanship, sympalliy, and tolerance. Others could be added almost ad 
infinitum and no listing may be considered definitive. What the wwds sym- 
bolize may overlap and they represent neither a logical nor a psychological 
system. But they are qualities that parents and teachers want children to 
acquire or to cxliibit. and vvlut they stand for aflccts the welfare of the class 
and the school. 

Citizenship as used in most school situations is a counterpart, if not a 
synonym, of character. Actions with respect to other pupils and adults and 
with re.spcct to the legal and ethical norms of society seem to be the rclercnts 
of citizenship so there is little dilTerence in opeiational definition between the 
two. Character may relate more to what underlies ethically judged behaviors 
and citizenship, somewhat more to the behaviors themselves, but for our pur- 
poses we shall treat them as one. 

Dimensions of character-citizenship must be selected to suit a particular 
situation. Such factors as honesty, co-operation, and obedience are likely to 
obtain, of course, for nearly all situations, but others less basic may not, i.e., 
tact, altruism, etc. In planning the dimensions of character or citizenship you 
wish to evaluate it is well to consider the following. 



416 


CUSTOMARY USES OF MEASUREMENT AND EVALUATION 


1. The dimensions are abstractions and hence matters of inference. For 
example, “honesty” does not stand for any given action, thing, or process but 
rather for the fact that certain actions seem to have something in common. 
We name the existence of a given common attribute “honesty”; whenever we 
attempt to measure it we may only infer its “existence” from the behaviors we 
observe. 

2. Most of the dimensions likely to be appraised in the name of char- 
acter or citizenship are considered to he attributes of behavior inferred to 
exist in some degree from zero to maximum. Thus, for the majority of them 
the most important subdinicnsion is amount or frequency: how obedient or 
how often obedient, when co-operative or how much co-operation, etc. 

3. Because they are to be inferred, they stiould be defined in terms of the 
actions or action situations from which given inferences are to be drawn. An 
illustration of such definition is this one for obedience: responds quickly 
to orders, follows directions exactly, anticipates what the teacher wants, and 
acts accordingly. 

4. Because character and citizenship dimensions arc qualities of learned 
behaviors, age differences are apparent for them. Certain dimensions may be 
valid at given ages but not at others; only those dimensions should be ap- 
praised at a given age whose possession is possible for that age pupil. For 
example, tolerance is an inappropriate dimension for preschool pupils but 
obedience is an appropriate one Co-operation is hardi\ a va^id consideration 
until the third grade or so and such attributes of maturity as altruism, idealism, 
and tact may not be significant until the secondary grades. 

Forms and Means of Measurement 

Having only inferred and ill-defined dimensions, the character or citizen- 
ship of a pupil may be described, roughly classified, or assigned a rank within 
a group. It cannot be expressed as a scale number and probably the most valid 
measure is simply a description. In this description, status within certain 
dimensions may very well be appraised in terms of a class or rank designation. 

Observation and peer rating arc the only procedures currently having much 
validity for assessing character or citizenship. Measures based on self-reports 
and tests of opinion and knowledge have been found to bear a nearly negligible 
relationship to pupil behavior (21:126-134). Product analysis is hardly 
pertinent to their measurement and w'hilc projective procedures have been 
found to be of some use in character analysis, their administration usually is 
beyond the resources of teachers. 

BEHAVIOR RATING SCALES 

The obtainance and use of peer opinion is described in a previous section 
of this chapter, pages 399-403, and techniques of observation are treated in 
detail in Chapter 4. It should suffice here to explain several devices for rating 
character dimensions, 'fhree such are illustrated in Figure 59. 



PERSONALITY AND CHARACTER 


417 


Graphic Rating Scale 

Honesty 1 2 3 4 5 

Lies and cheats Usually is honest Always is truthful 

whenever it will but may lie or and honest even 

help him cheat if threatened when he may be 

or frightened hurt thereby 

Modified Man lor Man 

In appearance and dress most like: 

L Lorn, whose hands and face arc often dirty, hair unkempt, clothes 

rumpled and soiled, shoes scufTed. 

- 2. Bill, who usually has clean hands and face, combs his hair and has 

pressed clean clothes but who is inattentive to his appearance, 
is apt to wipe his hands on his tiousers and to have his shirt tail 
out most of the time. 

3. Jack, who always is freshly scrubbed and combed, whose shoe^ 

shine and whose clothes are in perfect repair and press and who 
takes care not to get dirty or mussed. 


Check list 


Words checked characterize the pupil in question 


Courteous 
thoughtful 
. hair 

vSympathetic 

Friendly 

Cooperative 


Neat 

Prompt 

Obedient 

Rowdy 

Inconsiderate 

Takes advantage 


- Irritable 

. _ Selfish 
_ Rude 

Disobedient 
„ Late with v^ork 
_ _ C heats 
Flc 


Figure 59. F samples of ihiec types of devices used in rating personalilv-chdraclcr 
variables after obscnation 

Graphic Scales. fhc tt)rm at the top is called a graphic ratine scale. It 
consists of a line with nunibci- or subdivisions together with brief descriptions 
of how pupils act who posses^, dillerent degrees o’ amounts of the dimension 
in question. The descriptions coricspond \Mlh subdh isions, points, or numbers 
on the line. Pupils arc obscived and a determination is made of which descrip- 
tive statement any one most nearly approximates. A check is placed at a point 
or number on the line corresponding the proper description or to a point or 
number a proper distance Iron' it Jhc device is useful for all characicr-citizen- 
ship dimensions and, if the “graphic” statements are carefully dra\^n to cover 
the variation likely to be observed in the dimension, it is a valid procedure 
easy to prepare and to use. \ single page can contain a number of dimensions 
and provision for their rating. 

Man to Man Sc ales. The next form, “modified man for man,” is really 
a variant of the graphic rating scale. It permits, of course, only specified 
ratings and not intermediate ones as does the graphic rating. The form derives 



418 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

from efforts to measure character or personality by selecting a series of proto- 
type indtidfluals known to those who are to make the evaluations; the evalua- 
tors appraiSjB other individuals by selecting the prototype which each most 
resembles. As usually practiced now and as exemplified, the “man” with whom 
any pupil is compared is fictional, the names are there only for verisimilitude, 
and it is just a special type of descriptive rating scale. Its special merit lies in 
the more detailed description usually gi\cn. Its principal flaw is its lack of 
intermediate ratings but this can be remedied Obviously, many more than 
three “men” may and often should be used for the rating of a given dimen- 
sion. 

Check Lists. The last device in Figure 59, the check list, is a crude but 
widely used means of indicating the attributes most prominent in a given pupil 
No continuous variation can be expressed for any of the dimensions and, 
knowing their inferred nature, the device is inherently invalid for this reason 
Variation can be shown bv using numbers Irom, sa>, 1 to 5 for each factor 
rather than a check or the omission of a chc^.k. Without some index ol varia- 
tion for each of the dimensions, the form serves only to record the obsciver's 
gross impressions. It is tantamount to measuring a mountain range by enu- 
merating the peaks that exceed a given height You would get an impression 
of an extensive or small, a high or low range but that is about all Just so, a 
check list such as is illustrated might distinguish a very pleasant pupil from 
a very unpleasant one but hardly more 

DESCRIPTIVE RrCORDS 

Rather than rating pupils’ character dimensions, anecdotal or descriptive 
records may be kept These have the advantage ot greater possible attention 
to individuality and to variations other than in amount but may lack con- 
sistent coverage of all important dimensions lor all pupils. Moreover, purely 
verbal descriptions make for difficulty m comparing pupils The matter ol 
coverage can be handled by having printed on the record lorni all the dimen- 
sions to be appraised. How^ever, it is thought that a combination of anecdotal 
record and dimension-by-dimcnsion lating procedure should constitute a 
more efficient procedure than either used alone 

Evaluative Standards 

Idealized conceptions and developmental norms serve as the usual stand- 
ards for evaluations of character and citizenship. Idealized conceptions, as we 
have presented them, are statements of things as they should be or of grada- 
tions from some lesser condition to this point When applied »o citizenship or 
character, they are stereotyped descriptions of the maximum manifestation 
of any trait or this plus stereotypes for lesser conditions ( notice the first scale 
in Figure 59). Developmental norms, on the other hand, would be the char- 
acter-citizenship attributes typical of children of different ages. If the two are 
used in conjunction, we have idealized conceptions at each age level based 



PERSONALITY AND CHARACTER 419 

on the possibilities for children at that age. Such a joint standard is more 
useful than either of them alone. 

The ideal conceptions and the age modes are stated in some elementary 
grade courses of study and in curriculum manuals, particularly those that 
advocate “unit” instruction."* Child development textbooks contain develop- 
mental norms for many dimensions. More often, though, their determination 
is left to the individual teacher and, as a consequence, standards for character 
and citizenship tend to be highly subjective and to vary capriciously from 
teacher to teacher. 

We noted elsewhere in the text (page 191 ) that measurement and evalua- 
tion often arc coalesced in practice and that unwitting error may accompany 
the process. No aspects of educational evaluation are more susceptible to this 
condition than these we are discussing. In Figure 59, the statements in the 
first illustrative rating device imply value as well as describe possible status 
A child who is rated 5 in honesty not only has been “measured,” he also has 
been judged to be of high worth, lor in our society a high degree of honesty 
is considered a virtue, loo, in the cheek list at the bottom of Figure 59, cer- 
tain terms have laudatory connotations and others derogatory. Consequently, 
the pupil for whom the good words are checked has been praised as well as 
measured and the one receiving chocks by offensive words has, in effect, been 
blamed. Some mingling of the two operations is close to inevitable because the 
measurement symbols arc nearly alv\ays words and the words applicable to 
character and citizenship frequently have evaluative significance. 

Consequently, certain precautions should be observed. The registration 
of a rating always should follow observations and, if these arc many and 
occur over an extended period of lime, ratings should be based on observa- 
tional notes. The rater should be aware that be is evaluating and not just 
measuring. 1 he validity of the rating scale as a proper evaluative standard 
should be established and not just assumed (see Chapter 9, pages 193-194) 

Beset with subjectivity and with the coalescence of cxaluatinr and meas- 
urement, evaluation of character and citizenship encounters another difficulty 
in middle-class- -lower-class value differences. A. wc know, the conceptions of 
character and citizenship that constitute teachers’ standards tend to be middle- 
efass in character. On the other hand, a large pcrccnlagc of pupiK in many 
classes arc from lower-class homes and there character may be judged by very 
different standards Consequently, many pupils arc foredoomed to fail in 
citizenship simply because parental approval and reproof have been given to 
the “wrong” attributes. For example, too frequent fighting usually is judged 
by middle-class teachers to be a sign of poor citizenship. Yet, in many lower- 
class groups, ability and willingness to fight arc highly prized 

•"‘For a detailed examination of citizenship evaluation as it is practiced in the schools 
of one state, see “Evaluating Pupil Progress,” California State Department of Education 
BuUeiim XXI, No 6. April, 1952. 



420 


CUSTOMARY USES OF MEAS JREMENT AND EVALUATION 


SummaiV 

The dimensions of personality and character are such things as attitudes, 
interests, fears, beliefs, and values; the variables of personality structure ac- 
cording to a given theory (dominance-submissiveness, masculinity-femininity, 
for example); and various attributes of character (honesty, neatness, obedi- 
ence, etc.). Teachers normally are directly concerned only with the measure- 
ment of attitudes and interests with educational implications and the evalua- 
tion of certain character attributes, or citizenship. 

For attitudes, the more appropriate mcasuiing procedures arc observation 
and the use of rating scales; self-report, either through discourse, autobiog- 
raphy, or some guided response device; and guided response tests of opinion. 
Judgments about the value ot given attitudes usually arc based on group 
norms, on the idealized conceptions of sociologists, psychologists, and 
teachers, and on age expectancies as descrioed by child development special- 
ists. The measurement and evaluation of interests proceed in much the same 
way, with the addition of the interests of successful practitioners as the stand- 
ard for evaluating vocational interests 

Fears, beliefs, and values are approachable through the procedures of 
observation, self-report, and opinion expression and, to some extent, through 
product analysis and projective testing. The latter procedures are important 
for clinical examination of personality structure. In such examination inter- 
views and case studies also are basic procedures. Group tests have little use 
in the analysis of personality structure save for purposes of preliminary screen- 
ing. The evaluation of fears, behcls, values, and other \ariablcs of personality 
structure is based on a variety of standards, principal among them being 
Judaeo-Christian ethics; community, state, and national customs; and books 
by professional and nonprofcssumal persons about “the happy and pn^ductive 
personality.” 

Character-citizenship dimensions have a psychological significance differ- 
ent from the dimensions cited above. They have to do with reputation as 
much as with being. So, in addition to the procedure>> described, they may 
be evaluated through analysis of the feelings ot peers about one another 
Sociometry is the term generally applied to this type ot measurement and 
sociograms, “guess who” techniques, and straight peer ratings are among its 
devices. 


LXERCISCS 

1. Differentiate among the following basic procedures for measuring pcisonal- 
ity dimensions as to validity, case of use, and pertinence to particular dimensions: 

Observation of behavior 
Analysis of products 



PERSONALITY AND CHARAC iER 


421 


Self-reporting questionnaires 
Tests of opinion 
Projective tests 
Sociomctry 

2. Inspect several published group personality tests and write a critique of 
each in terms of its susceptibility to insincere and deceptive answers. 

3. Prepare a sociogram for a group of children or youth. 

4. Prepare a plan for evaluating the citizenship of pupils at the grade level in 
which you specialize. 

5. Construct a descriptive or graphic scale for rating honesty, co-opciativeness, 
and aggressiveness. 

6. Devise a form that a guidance official might use for case studies of “prob- 
lem” pupils. 



CHAPTER 16 


SCHOOL-WIDE TESTING PROGRAMS 


So far in this section we have dealt with the evaluation of pupil achieve- 
ment in given subjects or with the measurement of pupil intelligence, interests, 
citizenship, etc. In the presentation, attention has been focused on the individ- 
ual teacher and the pupils he may teach. Now in this concluding chapter of 
the section and the book, we wish to shift our focus to school-wide uses of 
measurement and evaluation. 

It currently is common in American schools and school systems to have 
an integrated and centrally administered program of testing. The reasons for 
this vogue seem to be twofold. In the first place there happens at last to be 
the necessary wherewithal for such programs The scientific movement in 
education has provided schoolmen with some essential tools and insights: 
for example, “normaf’ piobability tables, statistical procedures and symbols, 
mechanical and electronic scoring dc\ices, and accurate concept^ of individ- 
ual differences, the relationship between achievement and mental maturity, 
and the significance of pupil interest and adiustmcnt vStandardized tests are 
being published in great numbers, in wide variety, and in constantly improving 
quality. Moreover, the education of teachers is such now as to produce more 
and more teachers with skill and interest in testing. And, perhaps the most 
important factor of all, graduate schools arc training psychometric, guidance, 
and research specialists who arc capable of planning and administering testing 
programs. In the second place, most elementary and secondary schools arc 
trying to do things that are facilitated if not made possible by school- wide 
testing. Among these are personal and vocational guidance, dillerentiated 
grouping within classes or among classes, determination of the mean achieve- 
ment of pupils in given grades and subjects, and controlled experimentation 
with methods and materials. 

In our study of testing programs we shall deal first with the phenomena 
usually tested. Then we shall discuss the several procedures and processes the 
program may entail. After this, attention will be given to the various applica- 
tions or uses to which the results are to be put. Finally, some general tenets 
will be given for efficient testing programs. 

Focal Points of School-Wide Testing 

The standardized tests given to pupils in school-wide programs usually 
are directed toward measuring intelligence, reading ability, achievement in 

424 



SCHOOL-WIDE TESTING PROGRAMS 


425 


basic subjects, vocational interests, special aptitudes, and extremities of mal- 
adjustment. A few systems may undertake more than these and many are 
concerned with only a few of them. Intelligence probably is tested with the 
greatest frequency, with reading i*- the second position. Special aptitudes and 
detection of maladjustment arc the factors most likely to be exempted from 
a program. 

Intelligence, reading, basic subjects, and vocational interests we have dis- 
cussed earlier. The special aptitudes of primary concern to schools are those 
for mechanical, clerical, artistic, and especially musical pursuits. Evaluation 
of the latter is becoming a commonplace prelude to instrumental musical 
instruction in elementary and junior high schools. 

Systematic attempts to detect present or incipient extreme maladjustment 
are accompaniments to the schools’ concern over mental hygiene. During the 
last fifty years or so, in fact since there have been scientific explanations for 
disturbed and asocial behavior, teachers and administrators have faced two 
unpleasant realities. Many ‘*bad'' pupils, truants, vandals, thieves, and even 
the “shy” and the socially erratic but academically brilliant may have per^ 
sonality disorders and not just ill will toward teachers Some cases of educa- 
tional difficulty (reading disabilities, speech faults, test fright, etc.) may 
originate in or be aggravated b) a neurotic condition as well as by low intel- 
ligence or low motivation. As schoolmen have recognized these relationships, 
they have seen the desirability of screening pupils for maladjustment just as 
they have grown accustomed to checking for reading readiness. Because few 
reliable group tests of maladjustment have been devised, not many elemen- 
tary and secondary schools include the item in their testing programs. 

Instruments and Procedures in School-Wide Testing 

The instruments basic to a school or system testing program are published 
standardized tests. These may be tests of specific phenomena (intelligence, 
arithmetic achievement, etc. ) or they may be batteries. The usual .ximponents 
of batteries are the fundamental elementary school subjects of reading, arith- 
metic, language, social studies and science. In secondary schools there is less 
occasion to use test batteries but several are available, covering, as a rule, 
the traditional high school subjects: grammar, literature, history, .ilgebra, 
physics, and chemistry (see Chapter pages 121-126, for the geneial char- 
acteristics of standardized tests and Appendix B for test titles). 

Of necessity, most testing must be done in groups and tests designed for 
group use are the mainstay of any program However, the more efficient pro- 
grams provide for some individual testing as well. Certainly for intelligence, 
reading, and deviate behavior the results of mass testing often may need to be 
confirmed or amplified. As a rule, individual tests are not used automatically 
but only on referral. For intelligence, the Bind and more recently the Weeh- 
sler Intelligence Scale for Children are standard individual tests. Behavior dis- 
orders may be diagnosed by the Rorschach, the TAT, or a few other projec- 



426 


CUSTOMARY USES OE MEASUREMENT AND EVALUATION 


tive instruments. Several good diagnostic reading tests are on the market, 
among them the Durrell Analysis of Reading Difficulty and the Gates Read- 
ing Diagnostic Tests. 

Obtaining and administering standardized tests for even one school of any 
size is a complicated undertaking. For a niultischool district, a populous 
county, or a metropolitan school system it usually is a full-time, continuous 
activity for one or more persons. In procuring and using the tests certain 
factors need to be considered: selection, ordering, manner of administration, 
manner of scoring, handling results, and cost. While variation among schools 
as to purposes and personnel precludes any exact prescription as to each of 
these, efficiency may be increased by adherence to a few general principles. 

Selection of Tests. Publishers' catalogues furnish the most current list- 
ings of tests together with information as to cost, ordering, norms, and admin- 
istration time. They do not, as a rule, say much about the test’s validity, reli- 
ability, or standardization population. The latter information is critical, of 
course, in assessing the merit of any test. The most complete single source 
of validity data, etc., is Buros’ Mental Measimnumts Yearbook. The periodi- 
cal, Educational and Psychological Measurement, may contain reviews of 
given tests, and journals in specific fields (i.e.. The English Journal) often 
report critically on new tests in their fields. If specimen sets of tests are ordered 
prior to selection (and they always should he), the test manual may indicate 
what indexes of reliability and validity have been obtained in thc^test. 

Technically adequate standardized tests are coming to be the rule whereas 
once they were the exception. Consequently, any school should insist that 
any test it orders have certain esserVial features. 

1. A manual giving complete information as to standardization popula- 
tion, dimensions which the test may measure, norms, administration, scoring, 
reliability, and validity. For achievement, intelligence, and reading tests, it 
seldom is necessary to accept reliability coefficients of less than .90. Discus- 
sions of validity should be frank (any test is suspect that fails to discuss its 
validity, or that makes unsupported assertions about its validity) and should 
involve some logical analysis as well as statistics about correlations and item 
discrimination. 

2. Rec(?rd forms that allow for easy presentation of part scores. A graphic 
display called a profile sheet is the most common form. 

3. Simple scoring keys for both manual and machine scoring, if the latter 
is possible. 

4. Provision for machine scoring if the tests arc entirely guided response 
select-an-answ'cr types. 

5. Separate answer sheets so that test booklets may be reused. If machine 
scoring is possible, both machine answer sheets and manual answer sheets 
should be provided. 

6. Time of administration at least ten minutes less than the length of the 
period in which the test is to be given. As long as pupils are not fatigued, 
longer tests tend to be more reliable than comparable short ones. 



SCHOOL-WIDE TESTING PROGRAMS 


427 


Ordering. Each publisher stales his ow^i preference and rules foi* order- 
ing but the following obtains for most publishers. Five and sometimes 10 per 
cent discounts are allowed on quantity orders Ordering well in advance of 
need and stipulation of date of use will insure that the tests arrive on time. If 
there is doubt as to just what test materials are needed, describe to publisher 
what use is intended for the test and he usually will be able to send what is 
necessary. 

Manner of Administration. 1 he sine qua non of test administration is 
to secure maximum rapport on the part of pupils Usually this is easier to 
obtain by having the regular teacher administer the test than by using a test- 
ing specialist. In addition, though, directions must be followed exactly, emer- 
gencies must be handled, and timing should be punctual, so it may be neces- 
sary to have brief training sessions for teachers who are to administer the 
tests and even to exclude certain teachers uho seem to find test administration 
especially difficult. 

Use of regular classrooms (il they are amenable to a good testing situa- 
tion) is considered preferable to use of auditorium’' or gymnasiums While 
mass arrangements require fewer test proctors, they lend to increase the pos- 
sibility of pupil misdirection and inattention. 

Manner of Scoring. Scoring tests is a cleiieal task and a tedious clerical 
ta^k at that. The advantage usually cited for teachers scoring tests is that thc> 
will note the errors of pupils and make diagnostic use of the test. Equal oppor- 
tunity to diagnose a pupil’s difficulty is provided by returning scored answer 
sheets to teachers. In fact, rapid use of a key nearly precludes any notice of 
who missed what. Moreover, insistence that teachers score their own stand- 
ardized tests has led in many instances to negative teacher attitudes tow^ard the 
tests and even to diminished usage 

Hence, in the authors’ view, tests should be scored first by machines or 
by self-scoring stencils. If this is not possible, they should be scored manually 
by clerks. Only if neither of these is possible should it be necessary for teachers 
themselves to score the tests 

Handling Rt suits. Results ol standardized testing should be transmitted 
as rapidly as possible to the persons who will make use of them. The longer 
the dekiy betw^een test administration and knowledge of results, the less sig- 
nificance will attach to them. Even if the testing is for some purpose that does 
not involve the teacher, the teacher to whose pupils a test was administered 
w'ill, as a rule, wish to know their scores and would seem to have some sort of 
“moral right” to the infoimation 

It is advisable to consider that scores on standardized tests are confidential 
data: emphatically so for intelligence and personality tests. Only professional 
persons should have access to them. Obviously, student clerks should not 
handle them and pa^|jnts likely wall find a verbal interpretation more mean- 
ingful than a standard score or a percentile 

As to what the pupil himself should know' of his standing on any stand- 
ardized test, it is thought that his welfare should be the criterion. If knowing a 



428 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

score or having it interpreted will help him, he should be informed. If it will 
not, if it will frighten him, confuse him, make him feel loss of face, etc., he 
should not be informed. On this basis, pupils probably should not be told 
their Mental Age or IQ. They certainly should be told the rate at which they 
can read and be given some meaningful index of their reading comprehension, 
their spelling ability, their arithmetic skill, and of any item involved in the 
teacher’s regular day-by-day evaluations. Probably, pupils should not be 
given the results in terms of grade placement or percentile rank in a general 
population. 

Cost of Standardized Testing. The cost of tests and their administration 
obviously will vary according to the number and kind used and the extent to 
which special staff or released time is provided for the testing program. But 
without question the cost of a standardized testing program is trivial in com- 
parison to the total cost of instruction. For example, to give a general achieve- 
ment test (reading, arithmetic, vocabulary, grammar, and spelling! and an 
intelligence test to one pupil four times during a twelve-year period and to 
administer a vocational interest test once would cost in materials only $1.65.’ 
If an individual intelligence test is administered, this might cost $15.00, but 
most pupils would not be referred for an individual inlclligencc examination 
These costs are to be compared with the over-all instructional cost of educat- 
ing a child for twelve years, which is about $4,000 in California. In many 
states the figure is less but in some it is more. 

Hence, we feel that cost should not be a critical factc/r in determining 
what shall be tested and what tests shall be used. Revised forms of tests should 
be ordered if current tests are dated or faulty even though old booklets are 
still usable. A less reliable lest should not be chosen over a more reliable one 
simply because it happens to be cheaper. If on educational grounds it is de- 
sirable to include six things in the testing program rather than four and to 
have biennial testing rather than triennial, there should be little hesitation just 
on the ground of increased cost. In our example, use of two tests at three- 
year intervals and one once during the twelve-year period required only $1.65 
worth of materials per pupil. If this were increased tenfold, the cost would 
be only $16.50 for twelve years, or but $1.38 a year. And this presumes a 
new test booklet for each pupil lor each administration, which of course 
would not be necessary.- 

Locally Devised Tests. A school or system with one or more trained 
psychometrists on its staff may wish to devise its own tests. The chief ad- 
vantage of this is that the tests may then be tailored to fit exactly the curricu- 
lum, the pupils, and the purposes of the school’s testing pr )gram, whereas 
commercially published tests, being designed for general use, are only ap- 

1 California lest Bureau, 1956 catalogue. Other tests wou]jJ cost a little more or 
less. 

^Sacramento City Unified School District, California, estimates the cost of its stand- 
ardized testing program at 35 to 40 cents per year per pupil. 



SCHOOL-WIDE TESTING PROGRAMS 


429 


proximately suited to the program in any given school. The disadvantages of 
locally designed tests are several. As a rule they will not yield scores that allow 
comparisons with a general population of pupils. Because they do not have 
to survive in the market place, they are likely to have more technical imperfec- 
tions than published tests. To prepare them is a difficult and time-consuming 
task. In view of this, the psychometrist and the teachers who work with him 
may grow impatient, short-cut some of the needed analysis, and produce tests 
less reliable than their published counterparts. 

If tests are to be devised by the school or system, they should receive the 
same careful validation and standardization as the best published tests. Some 
information about the process is presented in this text (pages ^>0-126) and 
more detailed treatment is found in such books as Bean, Comtruclion of Edu- 
cational and Personnel Tests, McGraw-Hill Book Co., 1953; Traverv, Educa- 
tional Measurement, The Macmillan Co., 1955; and Lindquist (cd. ), Educa- 
tional Measurement, ACE, 1951. 

Cumulative Records. If the results of standardized testing are to be used 
for guidance and for instructional facilitation, each pupil's score on each test 
must be entered on a permanent record that can be accessible to persons who 
need the information. Any form is adequate as long as it is durable, readable, 
and provides for all the entries needed. Many schools and systems devise and 
print their own. Most frequently test results are recorded on the same card 
or in the same folder as is used for a permanent record of subject grades, at- 
tendance, health, and other guidance data. Several cumulative records arc 
published, among them: 

American Council on Education Cumulative Record Folders, Grades 1-1 II, 
IV-VI, VIl-XlI, XIll-XVl; 6<t per copy at any level. 

California Cumulative Guidance Record for Elementary Schools, Grades 
I-VHl; basic folder $1 1.25 per 250 copies (A. Carlisle & Co.) 

Cumulative Personnel Record, Grades VII-XII; in three torins, ca'^d, folder, 
and envelope; 5^ per copy for any form (National Association ol Second- 
ary School Principals). 

Uses of Testing Programs 

As we have observed, standardized testing programs are conducted in 
order to further certain functions or purposes of the school. Perhaps their 
foremost current uses are in connection with pupil guidance, abilit\ grouping 
(often called homogeneous or difTcrentiated grouping), and general instruc- 
tional facilitation. We shall discuss each of these uses briefly and ofler some 
advice re their efficient performance. 

GUIDANCE 

Guidance activities in schools are designed to help pupils make the best 
possible adjustment to schooling. They may involve full or part-time coun- 



430 " CUSTOMARY USFif OF MEASgUK^iENt AND EVALUATfON 

selors pupils with academiL and pertonal problems, prograrping, 

particularly in senior high schools, home room programSj group guidance 
sessions, Vocational exploration, and many other things 1 he scores that pupils 
make on standardized tests obviously arc important for several of these items 
If achievement test^nd mental test scores arc known a counselor is in a better 
position to discuss low grades, tardiness, disorder, etc , with a pupil Program- 
ing and vocational guidance can hardly be accomplished cflfcclively without 
evidence of the pupil’s ability and interests, test scores together with*.teachci 
reports constitute more reliable evidence than the latter alone 

Cumulative Recoids It is well to observe a few general cautions m 
using standardized test scores for guidance purposes, Siaoies should be entered 
in cumulative records accurately and completely OnT> derived scores mean- 
ingful in themselves should be used foi such cntriesi Raw scoics are meaning- 
less without the distribution of raw scores for the class or grade and class 
percentiles may be misleading if the class was other than an average one As 
a rule of thumb, enter only percentiles or standard scores derived from the 
test’s norms age or grade placement m subjects and incnt \l ages, and 10 s 
Each entry must bear the date when the test was administered and the name 
of the person who administered it 

Vocational Guidance Vocational inkiest tests tend to be less reliable 
than achievement tests Moreovci the voeational interests o( main adolcs 
cents arc subject to unpredictable virution Conseqiicntls it is^eli to hive a 
broad base for any decision about vocational training leachci and parent 
reports, interviews and aptitude tests as well as an interest test should bt 
involved m vocational counseling Interest tests are so available so easy to 
administer, and apparently yield such precise findings th it counselors ni iv be 
tempted to rel> on them alone 

Referral for Indnidual resting If an individuil intelliiiLnLe test is to be 
administered to a pupil or if a projective test and diagnostic inti-ivicw aie to 
be used, the matter ordinarily should be discussed in advance with the pupil s 
parents Only rarely will parents object but many will be angered or frightened 
if they hear of it after the event 

Parents’ concern over the social stigma they think nijv attach to Ihcir 
child being given a special test is cxamplificd in this incident Certain 
parents in a small town cons«mted to an individual intelligence test lor their 
daughter, which had been ordered by the teacher on the grounds that she 
thought the child was mentally retarded The teacher called the mother and 
told her this, by the way Over an intervening week the mother and the father 
were extremely agitated The grandmother was called long di tance and drove 
at once (150 miles) to be with her daughter in this time of distress The 
examiner, one of the authors, was asked not to go to the school as he usually 
did but directly to the pupil’s home Ihc mother then went to school and took 
her child home on some other pretext so that no one would know that a “psy- 
chologist” had looked at her Fortunately, it was found that the girl was 



SCHOOL WIDE TESTING IS^IIOCfRAMS ' 4M 

bright enough but that a conciliation of an ill-trained and overly ngii4«?|^er, 
an aggressive older sibling and parental criticism had produced a ira^of 
acute anxiety in the girl, with accompanying pOstural and cognitive iUfficulties 
A change of school, better handlrng at home, and the amazing capacity for 
recovery that children seem to have relieved the situation, 1)ut the family prob- 
ably will remember its trauma for a long time 

ABILITY GROUPING 

I he heed to accommodate school programs to the wide range ol general 
and special abilities possessed by pupils is an accepted fact No longer is there 
xiny public advocacy of the “Procrustcin bud” approach to compulsory edu- 
cation ^ nor IS It often found in practice I he most effective method ot ad- 
justing programs to individual differences, however has yet to be found and 
each oi the mans procedures now in use his weaknesses as well as strengths 
One of the more widclv employed devices is that of “ability grouping” 
I his approach often called homogeneous grouping differential grouping or 
sectioning involves the division of the pupils in a grade into several classes 
each of which is more ‘ homogeneous’ with respect to one or several dimen 
sions than is the whole giade Intelligence re iding and/or special aptitude 
[or a subjCLL ue the usual dimensions on which division is based three group- 
ings IS the usual extent of division and the device seldom is used below the 
seventh grade 

Ability grouping is a hMhlv eoptioversial issue and it is not within the 
see>pe of this text to inalyzc it let alone attempt to resolve it * We shall 
restrict ourselves to a few general comments and then describe how stand 
trdized testme may be used te cleteunine the com[>osition of sections 

fat tors to C (ftisickr tn Gfonpim^ In general segregating pupils bv abil 
ity prior to instruction has Ikcn lound most cncetive for those of low intcl 
iigcnee and least efleetivt tor those ot hmh intelligence Modification of 
method mateiiils and even ibjcetnes his been tound essential lor unless 
methods etc aic apprc^piiatL it seems to m<*ke liUL diflercncc whether there 
is a heterogeneous or a homoLe neons t'^roup PinalK ability grouping on the 
usual bases produces Lumps onlv slnzhllv less heterogeneous than the origin il 
one ind htmumeneirv with respect to one dimensit n does not insure homo 
gcneitv with respect to inoth».r perhaps ecjiiillv important 

Ihc mylholotie il Procrustus would ht w j\firtis to his bed b> strekhint them jf 
they were too short or lopping off portions if the> were too long 

1 Grouping Mithin a cUss for reiding instruction is demonstrably effective and is 
widely practiced in the primary grides 

Ticgs (26 262 2S5) presents a detailed analysis of ibility grouping Some other 
important sources of informalion on this subject are Northbv (17) and Otto (18) 
National Society for the Study of Education The C foupint, of Pupils 3*5th Yearbook, 
Pait 1 1936, and 4daptinf* the Sciondary School Pjoqrani to thi Needs t>f ^ outh 52nd 
Yearbook Part 1 1953 



432 CUSTOMARY USES OF MEASUREMElS^T AND EVALUATION 

Obviously, if pupils who are most alike with r^pect to a given subject 
are to be placed together for instruction, all the factors that bear importantly 
on achievement in this subject need to be considered. A battery rather than 
a single test approach thus is indicated. For the school subjects for which 
ability grouping usually is practiced (English, mathematics, social studies, 
and science) the use of an intelligence test, a reading test, and a test of readi- 
ness or aptitude in the particular subject seem to be mandatory. If only one 
test can be used for some reason, it is thought that this test should relate 
specifically to the subject in question. 

Frequency and Recency of TeMin^. It is known that children are erratic 
in their school performance, that they change »wcr a period of months and 
certainly years, and that standardized test scores have rather large elements 
of unreliability. Hence, two cautions should be observed. Several intelligence 
test scores, several reading test scores, etc., should be used, not just one in 
each category; and consideration should be given to previous marks in similar 
subjects and to the reports of previous teachers as well as to test results. Sec- 
ondly, one or more of the tests used should have been administered recently, 
in the last six months as a rule of thumb The more than occasional practice 
of using tests administered in Grade VII or VIII as the basis for dilTerentiated 
grouping in Grade X has serious limitations if not dangers. 

Group for Specific Subjects Only. Assignment to any classification 
should be for one subject only. Two or three categories of ^Jbncral ability 
should not be established and pupils then assigned automatically to the com- 
mensurate sections of each subject they take for which grouping is practiced 
Possible variation is too great between achievement in one subject and that 
in another to permit this For lante groups, ot course, the mean achievement 
of slow pupils tends to be low in all subjects and, for brighter pupils, high 
across the board. But for any individual no such exact uniformity may be 
predicted. 

Review and Verification of Assignments. Finally, there should be fre- 
quent review of assignment to given groups and opportunity for verification 
testing at any time if parents or teacher do not agree with the assignment made 
by a counselor or principal. These two processes will more than justify their 
expense in terms of parental and pupil acceptance of an ability grouping pro- 
gram and in terms of more efficient grouping. 

INSTRUCTIONAL FACILITATION 

Both the guidance and ability grouping applications ol testing programs 
are intended to facilitate instruction, if indirectly. In addition certain direct 
applications are possible. The range and central tendency of scores on intel- 
ligence and achievement tests for a given class are informative to any teacher. 
They suggest what progress he may expect from a class and in some cases at 
what level he needs to start With respect to individuals, standardized test 



433 


SCHOOL-WIDE TESTllSfG PROGRAMS 

scores foretell the poor headers, the slow thinkers, the able pupils, and those 
for whom instruction must be enriched if it is to challenge them. 

If test scores for a class are given in terms of national norms, knowledge 
of them enables a teacher to select materials more etficiently. For example, a 
teacher may be able to order any of three textbooks, each designed for use in 
the same grade but differing as to reading difficulty. After inspecting a tabula- 
tion of reading scores for the new class, he may decide to use the easy one, 
the most advanced, or the middle one. Or he may wish to procure some of 
each. But in any event, his decision can be more rational with knowledge of 
the test scores than without it. 

Gauging Class Progress by Standard Test Norms. A further use of stand- 
ardized tests is to gauge the progress made by a given class in a given period 
of time. If the subject(s) in question are those for which tests with national 
norms exist, a teacher may pretest at the beginning of an instructional interval, 
retest at the end, and see how much gain has been made. Since the norm 
tables for the tcst(s) will show how much gain might normally be expected 
in this period of time, the teacher may in cllect have some basis for judging 
the effectiveness of his instruction as compared with instruction generally. If 
this sort of analysis is made, it is imperative that recognition be given to the 
following. All learning in a period is not due just to the teacher and neither is 
all failure to learn. A teacher’s objectives ma> differ from those upon which 
the test norms are predicated and the norms will be inapplicable to some 
extent. Unless a class has a range and mean of intelligence comparable to that 
of the population upon which the test was standardized, direct comparisons 
with age or grade norms arc inappropriate. 

Restrictions on Use of Standardized Tests. In connection with instruc- 
tional facilitation, two non-uses should be mentioned. Subject marks and/or 
promotion should not be based on published standardized test scores. The 
tests either will not relate to all objectives and aspects of a subject or, if 
they do, there is a tendency lor instruction to be directed tow^ard passing The 
test rather than toward permanent important changes in pupil know'ledge and 
behavior. 

The other dont do it is using standardized test scores in an administrative 
evaluation of teacher effectiveness. For the teacher to do this himseli is one 
thing, and in some cases desirable, but for the principal or supervisor to do it 
is another. The practice is condemned b> all texts in school administration 
and supervision of w'hich the authors are aware. 

Some General Tenets of an Efficient Testing Program 

Organization. If a school-wide or district testing program is to operate 
efficiently and is to furnish valid information for each of the several uses we 
have discussed, there needs to be a permanent and stable administrative 
organization for the program. Just what this organization should be, who will 



434 CUSTOMARY USES OF MEASUREMENT AND EVALUATION 

do what and in what order, remains for each school or system to decide on the 
basis of its own situation and what it wishes to accomplish with the tests. 
Whatever the situation, though, one person should have clear responsibility 
for over-all direction of the program. There should be provision tor teacher 
and principal participation in decisions that aflcct them and concerning which 
they are qualified to judge. The psychometric staff should be large enough to 
handle all technical phases of the program. Finally, there needs to be pro- 
vision for periodic evaluation of the program and for revision as indicated. 

Role of Director, The director of the program and/or his assistants 
should be required to serve as a consultant to teachers and administrators, 
helping them understand and use the test resuhs properly and even assisting 
them in devising their own measuring devices. It is particularly ill-advised for 
him to devote his time exclusively to the administration ot tests and a tabula- 
tion of scores. School psychologists and psychomctrisis who construe their 
jobs this narrowly tend to lose contact with the instructional piogram they arc 
supposed to serve. Moreover, teachers think of the testing program as being 
something apart from them and consequently may give less than adequate 
attention to it. 

An additional nontesting responsibility ol the progiam director should be 
in-service education for teachers. The purpose of the activity is to increase 
the competence of teachers in administering and using the results of stand- 
ardized tests and, as well, to increase their skill in devising atid using their 
own evaluative procedures. Among the possible means of conducting such m- 
serv'ice education are bulletins, seminars for teachers in a building, grade, or 
subject, demonstrations of how to administer tests, and analysis of tests sub- 
mitted by teachers. 

We stated earlier that the results ol testing shimld be communicated 
rapidly to all who can use them and we wish to reiterate this point Unless the 
results are to be applied to the problems ol schooling, there seems to be little 
value in administering them. The idea is not so much to give a standardized 
test but to accomplish some en iliin^ by administering the test. When 
teachers and principals participate in administration of the tests and there is 
some semblance of an in-service education program, thcie tends lo be less 
“filing” of results. 

Scope and Frequency of I estin^. What tests arc to be given and how 
often is strictly a function of the local school. F^racticcs vary widely and there 
apparently are no grounds for assciting that any given practice is best. Los 
Angeles schools in 1950 were using an intelligence test every other year from 
Grade I on, an achievement battery at two-year intervals from Grade IV to 
Grade IX, and in high school one comprehensive test of educational develop- 
ment (11). In some other large districts, intelligence is not tested until the 
third grade and at three-year intervals from then on. It is fairly common 
minimum practice and probably wise to use an intelligence test in the primary 
grades, achievement batteries in the intermediate and upper grades, and an 



SCHOOL-WIDE TESTING PROGRAMS 


435 


intelligence and reading lest at the beginning of high school. Th^ prognostic 
value of vocational interest tests before the eleventh or twelfth grade is moot 
but a number of districts administer them in Grade VIII, IX, or X. Group 
personality tests rarely arc used below the high school level. 

A pupil’s achievement in a subject may change rapidly in a short period 
of time, so there seems to be no point of diminishing returns tor frequent 
achievement testing. Intelligence testing, on the other hand, is directed toward 
an aspect of the pupil that changes slowly and regularly. Moreover, a pupiFs 
status relative to his peers tends to remain fairly constant (see page 380). So 
there is for intelligence tests some maximum frequency of administration 
beyond which it may be fruitless to go. Research is lacking to establish just 
what this maximum is but we suggest that use of intelligence tests at two-year 
intervals is about as often as needed. More frequent intelligence testing for a 
special purpose (verification of a possibly spurious test result, change of 
schools, etc.) is, of course, necessary and worthwhile. 

Proper Emphasis. As a final tenet of efficiency for a testing program, the 
program should be such as to stimulate better instruction and provide im- 
portant information. !t should not be employed to check on teachers, to con- 
trol the curriculum, or to make unqualified comparisons between Teacher 4 
and Teacher B or School A and School B. Positive applications of the program 
are likely to suffer if these negative ones arc made. Moreover, effective super- 
vision, curriculum control, and curriculum evaluation involve far more than 
standardized test results 

Summary 

School-wide testing programs usually are directed toward one or more of 
the follow'ing: intelligence, reading, general achievement in basic subjects, 
vocational interests, special aptitudes, and detecting extremities of maladjust- 
ment. 'The basic measuiing instruments employed arc published standardized 
tests that may be administered to many pupils at one time. The use of individ- 
ual tests is restricted in most schools to pupils who show an extreme deviation 
cither in intelligence or personality. 

In selecting tests, critical reviews in the Mental Measurements Yearbook 
and in appropriate journals should be consulted in addition to test catalogues, 
and test specimens ahsays should be examined before selection. Results of 
testing are confidential data but they should be communicated quickly and 
fully to the professional persons who are to use them. The cost of a testing 
program is no more than a fraction of one per cent of the total cost of instruc- 
tion. 

School or system-wide testing programs are used in connection with a 
number of school functions or purposes. Among them arc guidance, ability 
grouping, and general instructional facilitation. In guidance, results of voca- 
tional interest tests should be considered suggestive only, and individual testing 
should be discussed with parents before use. Ability grouping should be based 




APPENDIX A 


GLOSSARY 

Ability tests. Tests that purport to measure an indiviJiail's over-all lacility in doing 
given things. Ollcn a distinction is attempted between that faciliu which 
results from heredity and that which results from learning. In such cases, 
ability tests are usually applied to the “native'’ aspect and achievement tests 
to the learned aspect. 

Absiirditv items. Fathei statements or pictures that contain an clement which is 
incongruent, inconsistent, contiadictorv, invalid, etc. The testee is required 
to pick out such absurd clement. Used in intelligence tests and in measure- 
ment ol critical thinking and reasoning ability. 

Fxample: Fhe rabbit chased the dog around the >ard. 

Achiew meat tests Tests that purport to measure an individual’s performance or 
competence relative to a given subiect, usually a subject taught in the schools. 
Achievement tests arc concerned with learned outcomes (generally knov\ ledge 
and or understanding) rather than “native” capacity oi ability to learn the 
subject. 

Admission tests. Tests or other measuring devices used to determine tht' eligibility 
of students for admission to schools or special curnculums. Ordinarily the 
province of colleges and the armed services. 

I'xample: C ollege fintrance Board Lvuninations. 

Synonyms: Selection tests and srieennn^ lists. 

Age equivalents. A method of expressing scores on standardized tests. The raw 
score typical of pupils of different aees is deter mined and then .my pupil’s 
raw score ma) be converted to the age to which it pertains Usualiv given in 
years and months. 

Example: Mental age - 12 6; reading age It) 4. 

Age norms. The typical scores made on standardized test by pupils i-f dilfe'cnl 
ages. Usually expr'Cssed in tabular or graphic loim, raw scL>res being related 
to the correlative age in yeais .md months. 

Synonyms: .dge tables, aqc charts, and age conversion tables. 

Alternate forms. Standaidr/cd tests sometimes are issued in two forms containing 
different items, which yield scores relating to thi^ same dimensions and have 
the same significance so tar as age, percentile, standard score, etc. equivalents 
are concerned. The purpose of this is to permit retesting without undue prac- 
tice effect. The less used form is called the alternate form. 

Example; Stanford Revision of the Binct, Form M. 

Synonyms: Equivalent fohn and compauiblc form. 

441 



442 APPENDIX 

Analogy items A type of verbal or graphic item frequently used in tests of intel- 
ligence to measure reasoning abilitv, specifically the facility for generalization 
They consist of two parts, one of which expresses a relationship or com 
parison and the other of which requires that the same sort of relationship or 
companson be established among other elements 

Example. Pig is to bacon as steer is to tree, fish, steak, shoes, corned beef 

Analysis of variance The total variation of given measures for a group of individ 
uals may be accounted for by variations among the individuals with respect 
to other factors For example the variation in IQ of a group of students mav 
be attributable to their vaiiations in age, cultural background socioeconomic 
level, etc The analysis of variance technique tests the statistical significance 
of the effect of each factor That is it tells he extent to which chance varia 
tion might have produced the same effect 

Anecdotal reports or ratings Appiaisals based on observations recorded in the 
words of the observer Sometimes special forms called anecdotal records are 
used for the writing 

Appraisal A general term meaning to determine and express the status of any 
thing Often used synonymously with measurement but usually connotes less 
precision in results Sometimes used synonymously with evaluation hut tends 
to imply less judgment 

Appreciation tests Instruments designed to measure attitudes and judgments rela 
tivc to given subjects usually art, music, and liter iturc 

Approximation A concept basic to measurement, which asserts th it any given 
measure of a phenomenon can only ‘approximate a true ^ measuie of its 
ictual status Unreliable procedures produce scry gioss ipproximations ind 
more reliable ones produce the closer approximations In standardized tests 
the degree of approximation involved is indicated by the stand ud error of 
score for the test in question 

Fxamplc The IQ of 102 found by administration of a Bind test to a pupil 
only approximates the pupil’s ‘theoretically true’ IQ I In'* might be 100 10*' 
or even 9^, etc The Standard Error of Score of Binet IQ s is 4 to for the 
middle ranges of intelligence 

Aptitude tests Tests or other measuring instruments (usually st ind irdized .ind 
commercially published) which puiport to predict the c ise with which an 
individual will learn a given thing or the dcgice of success he is likclv to have 
m a given activity 

Example Stenquist Mechanical Aptitude lests Seashore Measine s of Musical 
r alent 

Arbitrary origin See assumed mean 

Arithmetic mean (X) The sum of a gioup of scores divided by the number of 
scores in the group Used as i representative score or i measuie of central 
tendency 

Synonyms Average, mean 

Arrangement items A type of guided response question or item which presents the 
elements of a graphic, mechanical or verbal pattern and requires that the 
testee put them in proper array Used fairly extensively in intelligence, pci 
sonahty, and mechanical aptitude testing 
Example Make a sentence out of these words 

me letter the to brought postman a 



GLOSSARY 


443 


Assumed mean. The point in a frequency distribution Arbitrarily chosen ‘to be the 
point from which the deviation of other measures is figured and from which 
a correction for the true mean is computed. Usually the mid-point of a central 
interval in a grouped distribution. 

Synonyms: Guessed mean, arbitrary origin. 

Attitude tests. Free or guided response tests which purport to measure the feelings 
of individuals toward certain things. Usually gauged are the valence or direc- 
tion of feelings, the referents of feelings, and sometimes the intensity of 
feelings. 

Example: Tests of attitude toward war, school, minority groups, etc. 

Average. wSec arithmetic mean. 

Average deviation. The sum of the absolute value of deviations id) from the mean 
in a frequency distribution divided by the number ot measures in the distribu- 
tion. An infrequently used measure of dispersion for a group of measures. 


Average deviation 


S_ldi 

N 


Bur graph. Any graphic presentation that uses bars of various lengths to symbolize 
dilTcrcnces in quantity, size, amount, etc. 

Ba.sal age. A convention of the Stanford Revision of the Bind Intelligence Scale. 
The mental age designation ot the last group of tests a child answers cor- 
rectly in entirety. Contrasted with top or maximum age, which is the mep*al 
age designation of the last group of te.sts in w'hich a child passes any test. 

Battery. A lengthy standardized achievement test with separate and independent 
parts for each of several school subjects or skills; or a group ol tests published 
by the same firm, applicable to the same grades, and standardized on the same 
population, hence producing comparable percentile oi standard scores. Bat- 
tery sometimes may mean any group of tests administered together for a given 


purpose. 

Bi modal. A distribution of measures, particularly test scores, with two foci ot 
central tendency rather than one. A superficial indication of bimodality is 
the presence of i»vo modes separated by scores or score intervals whose fre- 
quency is appreciably less than that of the modes. Bimodality m a distribu- 
tion can be suggestive of s< veral attributes of the group or ot the tust or other 
measuring procedure in usp. It often indie ites that the group which is bimodal 
involves two subgroups having important mean ditlcrcncos as lo age, men- 
tality, reading ability, nationality, etc. 

Biscrial r. A correlation coefficient computed for two variables, one ot which is 
a continuous normal distribution and the other is dichotomized (split in halt ). 
It is assumed that the underlying dichotomized variable is a continuous nor- 
mal distribution. 

Example: To find the correlation betveen honesty or dishonesty (a dichoto- 
mized variable) and intelligence test scores (a continuous variable) a biserial 
r would be used. 

Case. A generalized term for any entity in a group that is separately measured or 
counted. 

Example: A pupil in an eleventh-grade class, one guinea pig in a nutritional 
experiment involving 100 guinea pigs, one teaching situation in a study in- 
volving a comparison among many teaching situations. 

Synonyms: Individual, occurrence, entry. 



444 APPENDIX 

Ccilinf^ The highest degicc of skill, knowledge, or any other dimension that a given 
test can measure A pe^fcc^score on a test is its ceiling and pupils who achieve 
perlcct scores may be said to have “hit the ceiling ” A valid test must have a 
ceiling which exceeds that of the gieatest degree of skill, etc, to be en- 
countered in the gioup to be measured 

Cinttal tinciLnc\ In a dislnbulion of scores or other measures, the point or in- 
tcival at which a plurality or majoiity of scores tends to cluster Unless there 
IS such a clustering the distnhution his no cv ntral tendenev 

Chance factor In any guided jesponsc, the correct option for anv item /}iay he 
finessed as well is known The extent to which chance mav be the cause of a 
correct response rather than knov\Ldge or other lational deteiminant is a 
function ol the numbei ol options nd the number ol those that arc correct 
for anv Hem Chuncc factor mcan^ that propoition of the m ixinuim possible 

score on a guided rLsponse test which can be attributed to chance, or the 

odd^ .igainst guessing expressed cither as a ratio or as a pcT^centagc 
Lxample In a true -lake test the chance factor i^ one two, or 50 per cent In 

a five-option multiple choice test the chance I actor is one five or 20 pei cent 

Cluck lists A device used in observ Hion to diicct attention to tactc>rs to be ob 
served and sometimes to provide space foi recording ratings or comments 
relative to them 

C lu-sqiiate (y-) The sum ol the squared disciepancics between obseived (O) and 
expected (I ) frequencies, each divided b\ the expected trsqucncy 

, /)- 
y. __ 

I his statistic IS geiiLi illy usdul in testing agiecmcnt with a piioii frequencies, 
and in pailicular lor testing goodness ol ht, group diflcicnccs change and 
independence 

fxample Chi sqinie mav be used to cheek sex difinenccs in passing or fail- 
ing a test ilun 

C A {cluonoloi^ical ai>c) A child s «ige expressed in veais and months Used in 
reckoning the intelligence quotient and any othci index involving a cc^rn 
pansoii between skill or knowledge and age 

Example C A 12 6 means the child’s age is twelve >eais, six months 

Classification One of tour basic forms of me isiiiement (txpes ol measuument 
symbols) Involves the establishment ol categoiies (classifications), the 
designation ol s\nibols lor the categoiies, and then the assignment ol the 
symbols to phenomena according to the category to whieh they belong 
Examples BlcK»d typing di ift elassific* tions i B i PI is course marks. 

Coefficient. A special name applied to certain ratios oi proportionality constants 
Example Se'e coc flint nt of con elation 

Coefficient of correlation (r) Basically, it is i measure of the degree of closeness 
with which the variation of one vaiiablc is associated with van ition of another 
variable In this text it is shown ih it the square of the coefhcient of correla- 
tion IS equ.il to the ratio of the expl uned vaiiancc to the total variance 

It IS impoitanl to note that in computing a coi relation coefTicicnt between 
two varial'ilcs it is necess.iry to assume that the undci lying distribution for 
each variable is a continuous normal distiibution 

Example Coefficient of correlation between intelligence and school marks 
IS about 50 



GLOSSARY 


445 


Completion items. Test questions which require that one or more missing* parts ot 
a statement be filled in or completed. Classified as a guided response (or 
objective) item but may involve some interpretation in scoring. 

Synonym: Fill-in Hems. 

Comprehensive tests. Any tests that cover a wide range ot subject matter. The 
term is used primarily at the college level *o refer to achievement tests cover- 
ing entire academic areas, e.g., economics, biology, education, etc. 

Confidence interval. An interval used to estimate the value ot a tine score oi a 
true mean in which a certain amount of confidence is placed. 

Example: the chances are 95 out of 100 that a student’s true IQ is between 
92 and 108. The interval 92-108 is called the 95 per cent confidence interval. 
See level of confidence and level of significance. 

Construct. A verbal, mathematic, or graphic exposition olTcred as an explanation 
for natural phenomena. 

Example: Atomic theory, Freiulian psychiatric Iheorv. 
wSynonyms: Model, theory, hypothesis, explanation (at some times). 

Contini^encv coefficient. A measure ot the degree ol association or correlation be- 
tween t\\o variables whose variations arc expressed in teims ol categories. 
Chi squaie (Z“) is used in its computation. 

Example: A contingenc> cocflicient would be used to indicate the correlation 
between /I's, /i's, C”s, and IXs received in a mathematics course, and 
yfs, C’s, and /)'s received in a physics course. 

Conti oiled ohseivation. Observation ot behavior in which those under observation 
are subjected to prearranged stimuli or in vshich the liming, focus, and record- 
ing of observations are highK sysicniati/ed. 

('orrection. Statistically, a numerical (iiiaiinty added 1(> oi siibtracietl from an 
estimate in order to obtain a true amount or at least a better estimate. 

Correction for attenuation. A correction applied oiiliiianly to con elation coefli- 
cients that have been reduced or attenuated b\ variable ei rors oi measure- 
ment. A coi relation cocflicient, tlien, which has been ciniected loi attenua- 
tion, is an estimate ot what the correlation W'ould be il leased upon perfect 
and erroiless measurements. 

Correction for i’uessin^ Any of several s\stematic vva\s of penalizing wrong an- 
swers more than omitted an.swcrs in true-talse and multiple-choice tests. Such 
correction is based on the assumption that all students will try to gue^^s many 
answers unless detcired from guessing b\ the knowledge that they will be 
penalized for it. It is intended to mininn/c the chance vaiiation in scores. 

Correction fonnulas. Eormulas used in correcting raw seoies of ttwts for guessing 
(see page 117). Tlie etTectiveness ot these formulas is questionable. 

Correlation. I'he tendency of the variation of one variable to be accompanied by 
the variation of another variable. A cause-elfect relation is not necessarily 
inferred between the lv\o variables. 

Covert dimensions Conditions, elements, or propeifies attributed to unobservable 
aspects ol behavioi, thinking, attitudes, drives, etc. 

Example; Imagination, intensity of atiitudcs, and drive si length. 

Roughly synonymous with Inferred dimensions. 

Criterion, In general, anything with which a measming piocediire is compared 
in determining its validity. Specifically, a measuring pmcediire for a given 
phenomenon for which exemplary validity is claimed or assumed and with 



446 


APPENDIX 


which other similar proc^pdures are asked to have high positive correlations. 
E?[ample: The Stanford Revision of the Binet for intelligence, the Rorschach 
for personality analysis. 

Synonym: Standard (in some contexts). 

CR (Critical ratio). The quotient of the difference between two statistics divided 
by the standard of error of this difference. The ratio usually is interpreted to 
mean standard deviation units in a distribution of differences between the 
statistics that could be produced by chance; hence is an indication of the 
significance of the difference. 

Example: Difference between percentage of students expressing liking for 
mathematics and those expressing dislike might be 32 per cent; the standard 
error of this difference might be 12 per cent: the Critical Ratio then would be 
30/12 — 2.5 and such a difference should l>e found on a chance basis only 
about 1 out of 100 times. 

Culture-free tests. Intelligence tests have been criticized on the ground that famili- 
arity with the content and values of middle-class Anglo-American culture 
affects the scores of pupils. Culture-free tests are those that purport not to 
involve the content or values of a given culture. 

Cumulative frequency. A column in a conventional tabulation of scores or other 
measures that shows the frequency of scores up to and including any given 
interval. 

Cumulative record. Any form used by a succession ot teachers and or counselors 
in recording data of importance to the academic progress and guidance of 
pupils. Intelligence, reading, and achievement test scores usually arc included 
as w'ell as subject grades, observations on deviate behavior, heaTlh information, 
and comments on social adjustment. 

Curve lirting. A name given to the statistical methods tor determining the equa- 
tions of straight lines or curves that best fit the plotted points of a graph or 
scatter diagram. 

Curvilinear relationship. Applies to the situation in which the plotted points of a 
graph of tw'o variables approximate a curve rather than a straight line. 
Example: Age as against height. 

Decile. Any one of nine percentile points in a distribution ol scores that divide 
the distribution into ten equal parts. The first decile is the lOlh percentile, 
the second decile is the 20th percentile, and so on. 

Derived score. A test score that has been converted to an index of rank, scale 
position, or classification, as distinct from a raw score, which is the number 
of correct responses or the immediate numerical weight given the test. 
Example: Percentile rank, standard scores, mental age. 

Synonym: Converted score. 

Description. In this context, an informal type of measurement expression used to 
indicate the status of phenomena in which ordinary language is used. Descrip- 
tions may include scale, rank, and classification symbols. Associated with 
observation procedures and the appraisal of citizenship, study habits, social 
adjustment, etc. 

Deviation. In general, departure from a given condition. In particular, the nu- 
merical difference between a test score or other measure of an individual 
and a given point of reference, usually the mean of a group of test scores 
or other measures. 



GLOSSARY 447 

Deviation IQ. An intelligence quotient determinedly converting a raw test score 
to a standard score on a scale that has 100 as the mean and approximately 
15 or 16 as its standard deviation. It is opposed to the more traditional ratio 
IQ, which compares mental age with chronological age. For children, devia- 
tion IQ’s and ratio IQ's have essentially the same significance and may be 
compared. 

Diagnostic tests. Tests designed to reveal points of strength and weakness in a 
pupil’s skill or knowledge in a given subject. They arc characterized by part 
scores and or a profile. Some intelligence and personality tests are considered 
diagnostic in that they provide analytic scores. 

Digit span. A convention of intelligence testing which refers to how long a series 
of numbeis an individual can recall, having heard them spoken at second 
intervals. Used as an index of memory. 

Dimensions. A collective term for properties, aspects, attributes, qualities, etc., of 
phenomena subject to measurement. Anything with respect to which we meas- 
ure a phenomenon. 

Example: Height of a pupil, accuracy of spelling, rate of reading. 

Synonym : Variable. 

Discriminating power. The characteristic of a test item to distinguish between two 
or more groups of people, usually those who have great knowledge of a 
given subject and those who have little, or those who manifest a given "trait" 
and those who do not. 

Distractor. Any option in a multiple-choice or matching item that is incorrect. 
Distribution. A table or graph showing the scores or other measures found for a 
group, so arranged that the number who have a given score or who fall 
within a given range of scores is apparent. 

Synonyms: Frequency distribution, frequency tabulation, distribution table. 
Lducational age. When age norms are determined for a student in specific subjects 
such as reading, arithmetic, social studies, science, etc., the average of these 
age norms is called his educational age. The index is construed to mean that 
a pupil’s level of academic achievement is comparable to that of the average 
pupil of the age given. 

Example, .lohn’s educational age is twelve years, three months. 

Synonymous with Achievement age, and may be converted into grade place- 
ment. 

Ln trance examinations, tests. Tests or other measuring procedures used in selec- 
tive admission to schools and/or curriculums. Employed piimarily by col- 
leges and professional schools. 

Example: College entrance board examinations, graduate record examina- 
tions, American Council on Education psychological examination. 

Synonym: Selection, admission tests or examinations. 

Equivalent form. Either ol two forms of a measuring instrument, particularly a 
standardized test, parallel in content, difficulty, and norms, but different as 
to items. 

Synonyms: Alternate form, comparable form. 

Essay tests, items, questions. A type of free response test or item in which an ex- 
tended written answer is to be given to a discursive question. Commonly used 
in secondary schools and particularly in colleges. 

Synonym: **Blue hook" examinations. 



448 


APPENDIX 


Exaluation. In this context defined to be the piocess oi assigning symbols to phe- 
nomena, which s>mbols signify the worth of the phenomena relative to some 
scheme of value 

Lvaluativc standard That with refeiencc to which value is judged In particular, 
a scale, hieraichy, or senes of giadations oi levels describing the variations 
ol value in the status of a phenomenon and the value assigned to given grada- 
tions or levels 

Examination A gcncial term, svnonvmous with test in most cases, which denotes 
any pioceduie used toi human measurement 

Examiner A pci son who adminisltrs an examination or test 
Synonyms Icsta test adniinistiatoi test proctor 

Lxtiapolaiion Intel iing nieasiiics outside the raigc of known measures on the 
assumption that any observed pattLin in known measures will continue 
JLxample It is known tint the m< an score of eighth giaders on a given test 
is 80 ninth giadcis, 85 tenth graders ^0 and eleventh giadeis, 95 By extra- 
polation, the me in tor twelfth gi idcis I^ 100 

I actor Ainthim? treated is m entit> which is known or presumed to ne a signifi- 
cant part or aspect ol a complex phenomenon lor example, in intelligence, 
the t«ictor ot memorv m handwriting the tactoi ol slant 

S>ntMivms Dimension \auahU In tactoi analysis” factoi has special mean- 
ing and Tckis to th<it which underlies en causes positive eonelatie^n between 
SLOies on dilTcrent tests oi parts of one test 

I actoi analxsis \ teim applied to methods b ised on the mtereoi lel itions of 
several tests that attempt to aceount foi I he intLiielationships fimong the tests 
in tcims ot a few underlying factors I letor anaKsis has aided in the iden 
tification ol b*isic components ot intelligence, aptil titles, and personality 
through a studv of the inteiiUatioiiships ol alieady prepaied tests in these 
areas 

I ill-in Items A l\pc ot guided response item or question in which sentences are 
presented hiving one or more missing words I he tcslee is lequircd to fill in 
the absent lerm(s) 

Example lom Sawyer tlo Ued down the River m the book 

called 

Floor. The le ist degree ot kne)wkdgc, skill intellnenee, eie that a test is capable 
of mcasuiing. If an) pupil has a raw score ot zero (no questions answered 
correctly), his knowledge etc is below the ‘ flooi” of the test and hence 
the test cannot be used to measure him A valid test must have a floor con- 
siderabl) below that ot (he le ist knowing, skilled, or intelligent person to 
whom it is to be applied 

/ oil. A wrong or incorrect option in a muItiplc-choicc item 

Example For this question the undei lined responses arc foils 

The Piesidcnt of the United States in 1918 was tiardri^ f/ug/zes Wil- 
son, Roosexilt Faft 
Synonyms Distractor, incorrect option 

/ orm of measurement. A type ot number or othei symbol used to characterize a 
phenomenon with respect to some dimension. Forms in use arc scale num- 
bers, rank numbers, classification numbcis or symbols, and descriptive words 
or phrases 



GLOSSARY 


449 


Example: 12'6" is a scale form, 12th percentile is a rank form, B student 
is a classification form, and “makes more errors in punctuation than in spell- 
ing” is a descriptive form. 

Free response tests and items. Tests and items or questions thcicot in which a 
person may respond in his own words or with his own scif-devised actions 
and in which the possibilities of response are nian> and the length ol the 
response more than a word, phrase, or gesture. 

Example: Explain the construction ol a pleated skirt. 

Synonyms: Lssay question and, to some extent, ''suhjective” tests and/or 
items. 

Freciuency. Refers in statistics to the number oi times a score is repeated or to 
the number ol scores appearing in a given interval. 

Frequency distribution. Consists of a sequence ol score intervals and the trequency 
or number ol scores falling in each interval. 

Example: Sec page 133. 

Frequency polyyjnn. A line graph of a particular frequenev disfributiop The 
horizontal axis contains a scale for the score intervals and on the vertical axis 
is a scale ot fieqiicncy per interval. The licquenc> poUgon is then a senes 
of straight lines connecting points at the middle of each mtcr\al representing 
the total frequency in each interval. 

Example: See page 137. 

Grades. Has two common meanings. I. The 12 to 15 basic and sequential yearly 
subdivisions of elementary and second. uy schools; 2. evaluative symbols as- 
signetl to pupil w'ork, tests, or achievement to indicate its value In the lattei 
.sense, grades are synonymous with nunks. A, B, C\ /). and / aie the most 
frequently used letter grades or marks. 

Grade equivalent. An index of achievement in which a pupil’s ability or skill as 
measured by a standardized test is denoteii by naming the grade for which 
the ability is typical. 

Synonym. Guide phuement. 

Graph or graphic. Any visual ropr'cscntation of amount or quantity as I'elatcd to 
other variables. C ommon tspes are histograms, ficquency poKgons, bar 
giaphs, line graphs, and pie giaphs. 

Graphie' ratin,Q scale. A line with numbers oi a scries ol ruimhers that svmbolizc 
the lange of viiialion in some behavioral dimension under or b\ which arc 
written hiief descriptions of sescial diflcient degrees of the dimension. 

Group evaluation. Any discursive actiMty in which a gioup analyzes and ev.iluatcs 
its activity, progicss, or accomplishment. Used particulariv m unit instruc- 
tion and by committees or groups iliat pay attention to the tenets ot ‘ group 
dynamics.” 

Group tests. Te.sts adminislrable to a group o! persons at one time. Contrasted with 
individual tests, administrablc to only one poison at a time. 

Guidance folder. A folder used as a repository for standard tests, hcaltn records, 
teacher testimony, and other documents bearing on the ability, achievement, 
and adjustment of a school pupil. The folder may it sell be a printed form for 
the recording of subject grades, test results, and other data. 

Guided-responsc tests and items. A collective term for tests and test questions in 
which the possibility of response is strictly limited and the significance of any 
response often is predetermined. 



450 APPENDIX 

Example: True-false tests and items, multiple-choice tests, completion tests, 
etc. 

Roughly synonymous with Objective tests and items. 

Histogram. A vertical bar graph of a frequency distribution. At each interval on 
the horizontal axit^^ vertical bar is erected whose scaled height represents the 
total frequency in the interval. 

Example: See page 136. 

Index. In behavioral measurement, any number or other symbol that signifies or 
indicates status with respect to some dimension. 

Example: A given IQ is an index of intelligence, high school marks are in- 
dexes of college aptitude, a given coefficient of correlation between halves 
of a test is an index of the test’s reliability. 

Individual tests. Tests admimstrable to only one peison at a time. Contrasted with 
group tests, sec above. 

Inferred dimension. In this text a property or quality ol a phenomenon not itself 
observable but imputed oi inferred to a phenomenon. 

Example: Knowledge, reasoning ability, iniiovcrsion, etc. 

Usually synonymous with: Covert and intangible dimensions. 

Ink blots. The vague and meaningless blots employed in a famous personality test, 
the Rorschach. 

Instrument. Any device used to facilitate measurement. In education it is usual!) 
a piece of paper bearing printing or typing to which an individual is required 
to respond 

IQ (intelligence quotient}. The most common index of a person’s intelligence oi 
brightness relative to what is normal for his age. IQ of 100 means average 
intelligence and higher IQ’s indicate more, and lower IQ’s less intelligence. 
For children it may be interpreted as a ratio between mental age and chiono- 
logical age. For adults it may mean only their relative position among a 
general population of adults. 

Intelligence tests, te.sting. Instruments and procedures, and their use, employed to 
measure intelligence, menl«ihty, or general ability. 

.Synonyms' Mental tests, tests of general ability. 

Intercorrelation. A term applied to each of the correlations among a group of 
tests. Intercorrelations are usually displayed in tables showing the correlation 
ol each test with each of the other tests. Intercorrelations are then used to 
show the extent of interrelationships among a certain group of tests 
Interpolation. Estimating a value between two known points, linear relationship 
assumed. 

Example: It a pupil’s lest score is 94 and a table shows that a score of 90 
equals a mental age of ten years, and 102 indicates a mental age of ten years, 
six months, the pupil’s estimated mental age would be ten years and two 
months. 

Inventories. A name for a type of test that attempts to classify a pupil’s interests, 
attitudes, values, etc. 

Example: Kiider Preference Vocational Interest Inventory, Bernreiiter Pet- 
.sonality Inventory. 

Interval, i. An arbitrary portion of a range or continuum of scores, usually desig- 
nated by a lower boundary score and an upper boundary score. 

Example: 23-27 is an interval in a score range of 16 to 75. 



GLOSSARY 


451 


Item A question or direction in a test that requires an answer from Tthe person 
examined 

Synonyms Question test question test item 

Item analysis The process by which the relative difficulty and discriminating ability 
are determined for items in a test 

Item difficulty The percentage of individuals who get the item correct 

Example An item of 50 per cent difficulty would be answered correctly by 
half of those who responded to the item An item ot 85 per cent difficulty 
would be answered correctly by 85 per cent of those responding to it and 
hence would be a less difficult item 

Kev The correct answers for a test or a basis lor interpreting answers May he 
simply a test that has the correct answers marked a stencil or a coded sheet 
for an electrical scoring machine 

/ abeling items A type ot guided response test item which asks the individual to 
name the designated parts of a drawing map, or chart 

Li\el of confidence A statistical term used to indicate the degree of confidence we 
may place upon an interval estimate Levtl of confidence is usually indicated 
by a probability percentage, such as being 95 per cent sure or being 99 per 
cent sure 

Example On the basis ot a sample it is estimated that the true mem ot a 
population lies somewhere between 64 and 67 and this interval was computed 
with a 95 per cent level of confidence The chances, then, of the true mean 
being contained in the interval 64 67 aic 95 out ot KiO See hscl of 
nificaiK c and confidence inter \al 

I (\el of significance A statistical term used to indic ite the amount ot confidence 
m whether or not the difference between two means, two percent igcs or othci 
comparable measures is statistical!} significant (not due to chincc) 

Example It is reported that the diffeicnee in mt m scores ot boys and girls 
IS statistically >ignificant it the 01 level This means th it the ch mces are onl\ 
I out of 100 that the diffeience is not signific int Set confidence intenal 
and level of sigmfictince 

I inear relationship If the plotted points ot i graph ot two \ iriabks closely ap 
proximate a straight line then a linear relationship is s iid to c vist between 
the two variables 

Mejchine scoring Iffie process ot scoring a test by me ins ot m clectionic scanning 
device It necessitates use ot a special response sheet whcic answers are made 
by marking spaces or numbers with i pencil whose lead is an electrical con 
ductor 

Man to-rnan rating scales A type of latmg scale tor recording appiaisals of be 
havior in which the person lated is compared with a verbal description of a 
peison whom he is thought to resemble 

Marks The letter symbols (in some cases numbers) used to evaluate pupil achieve 
ment in school subjects and on tests and products rel ited to the subiects The 
most common marks arc A B C D I 
Often synonymous with Grades 

Matching items A type of guided response item that insolves two columns ot 
associated items The response consists of pairing the items Used to test 
knowledge of wars dates, authors books, animals-spccics, relationships, etc 

Matrix A tabulation of friendship or popularitv preferences among a group 



452 


APPENDIX 


Mean, See‘ arithmetic mean 

Measurement, The assignment ot a symbol, often a number, so as to charactcri/c 
the status of a phenomenon relative to some dimertision, usually b\ indicating 
Its scale position, it'^ rank, or its classification le this dimension 
Example. Finding ^ d is ten miles long, that a pupil lanks thud in spell- 
ing, that a pupil’s intcit mechanical 

Roughly synonymous with Appraisal 

Measurement error Error in a statistical sense is a matter of chance factois and 
not of mistakes made Consequently, mcasuicmcnt erroi is the difference 
between the true value and obser\cd value of a measurement due only to 
whatever chance factois ma> be present and not due to a mistake of the 
mcasurci The standard ciior is used to estimate the amount of such mcasuic- 
mcnt errors 

Median The score or point th it divides a distnbution ot scoies into two equal 
groups, with half ot the scores falling above and half below iJsed as a 
representative score or a measure ot ccrlral tendency 
Synonyms Middle scote 50th pticcntilc 

MA (mental aoc) I he age group whose average mental development is most like 
that of the child in question 

Mental tests testing Tests or testing piocedurcs that purpoit to mca>urc differ- 
ences in intelligence or general mental abihts 

Example Werhshr hUcllit*cnc( Scah for Childnn ( ah forma IistofMintal 
Maturity 

Synonyms lntellii*cmc ttsts, ftf^tutal ahilits tx sts etc 

Mid-point of inttnal The halfvv.is point between the actual bounelaiics of an 
interval It is found b> adding half the size of the interval to the actual lower 
boundary oi subtracting from tl^e actual upper boundary 
Example The mid-point of the inteival 3^ is T 4 V 2 

Mode The score or measure that occiiis most fiequentlv in a distribution 

Model A svstem of vcibal, mathematical and or graphic s\mbols, purporting to 
explain any phenomenon and that ma\ be the basis tor testing, experimenta- 
tion, prediction, instruction, etc 

Example Euclidean geometry, quantum theor> of light, general and special 
factors of intelligence, Darwmi.m theory of evolution 
Synonyms Jluoiy constiuct hypothesis etc 

Multiple-choice items Test items or questions widely used in standardized tests, 
which permit the teslce to choose an answer from <imon 2 several options only 
one of which is correct or each el which has a special smnificance 
Example I he author of 7 In Ad\e ntines of lorn Savsyet is 

1. Grey, 2 Cooper, 3 Twain, 4 London, Melville 
Synonym Multiple-option items 

Multiple ccjrrclation A multiple correlation is computed when more than two 
variables are involved Suppose we have three variables, x, y, nd z and sup- 
pose variables v and z arc combined to estimate variable x The correlation 
between x and the combination of v and r which best estimates x is called a 
multiple correlation 

Example High school marks and college entrance exam scores may be com- 
bined to predict first-year college grades Hence a correlation between first- 
year college grades and the combination of high school marks and entrance 



GLOSSARY 


453 


exam scores that predicts college grades is called a multiple correlation. 

N. In statistics, N stands fcjr the number of measures in a sample or in a distribu- 
tion. 

Negative discrimination. Students may be separated in* two groups, those who 
who made high scores on a given test and tb^" To made low scores. The 
proportion ol either group who were corre a given item on the test may 
then be determined. If a greater proportion of the group making low scores 
got the item correct than those who got high scores, the item is said to show 
riegatixe discrimination. This is considered evidence of invalidity in the item. 

Nonverbal tests and items. Icsts and items so developed that variation in responses 
to them is not importantly affected by differences among people in verbal 
skill. 

Example. Form board tests, geometric and mechanical pii//lcs, picture recall 
and discrimination items, manipulatory tests and items. 

Synonym: Performance tests and items. 

Normal curve. Refers to the normal probability curve, which is a theoretical dis- 
tribution ol probability described precisely by a mathematical equation. The 
curve has a distinct bell-shaped appearance. Some distributions of observed 
variables closely approximate the shape ol a normal probability curve. When 
this occurs, then certain probabilitv statements may be made in connection 
with these variables. 

Example: If a distribution of reading seorcs which approximates a normal 
cm VC has a mean of 62 and standard deviation ol 10, then we could say that 
the chances are only 16 out of a hundred for a student to get a score of 72 
or higher on the test. 

NormaU:im’. The process of readjusting test scores m a distribution to make them 
conform to the normal distribution, the assumption being that the true dis- 
tribution ol the trait being measured would be normal it there were a satis- 
lactory measure of it. 

Norming. The process of eslablishing equivalence between raw scores on a stand- 
ardized test and age, grade placement, peicentile rank, and standard scores 
in the population on which the test is being standardized. 

Synony ms : Staiuhn dizing, standardization. 

Norms. Statistics based uiion a standardization group or a group that is purported 
to be representative ol a much larger population. These norms arc thus as- 
sumed to be lepresentativc ol large groups such as all fifth-grade children or 
all twelve-year-olds. Grade, age, peicentile, and standard score norms are the 
most common form. 

Objective tests. Mc.isuring instruments that aie amenable to mechanical, electronic, 
or other scoring method little dependent on the inteipretation or judgment 
of the scorer 

Example: True-false and multiple-choice tests. 

Synonym: Guided t espouse tests. 

Observation, The most widely used and usually most crude method ol behavioral 
measurement. Involves direct perception of the dimensions of the phenom- 
enon being measured. With appropriate attentional, perceptual, and record- 
ing aids, observation can be a highly reliable procedure. 

Example: Watching a pupil study to determine how effective arc his study 
habits. 



454 APPENDIX 

Observation schedules Sheets of paper with captions, directions, etc , used to direct 
obsesrv^tion and often as a means of recording observations 

Opinionnaire Any instrument designed to elicit the written opinion of lespondents 
on given questions Subsumed by the more general term questionnaire, which 
deals with questions of fact as well as opinion 

Options, The response variants from among N\hich an individual may select his 
answer to a test question 

Fxample Columbus sailed to the West Indies in 
(a) 1470, (b) 1492, ^c) 1592, (d) 1607 
1 he four dates listed are options 
Synonyms Choices altcrnatixes 

0\ett dimensions Aspects or properties of behavio subject to measurement (di- 
mensions) that are directl> observable (overt) 

Example Rate of talking, slant of handwriting, strength of grip 

Synonyms OhsaxahJe dimensions ohnets of direct me cisiin ment tanttihlc 

propel tu s 

Parent-teacher coriftiences A discussion between parent and teacher regarding the 
progress of a pupil Used with increasing frequent v m .iddition to oi as sub 
stitute for a report card 

Partial correlation The correlation between I wo variables where the influence ol 
another variable or other variables has been eliminated 

Percentile A standard index of relative position or rank in a distribution ot scores 
which consists ot th it percent me of scoics lvin<a below inv mven score or 
point 

Example The 65th percentile is that score or point bilow which lie 65 pci 
cent ol the scores 

Performance test Any test or other procedure of measurement that is not im 
portantly influenced bv an individual’s verbal skill and that purports to meas- 
ure some nonverbal dimension of intelligence or achievement 
Example Form boards recall ot pictures, spatial perception tests etc 
Somewhat svnonymous with nomeihal tests 

Personality tests Any tests that purport to measure the diflerences among individ- 
uals as to the nature and dynamics ot their goals tears, identifications needs, 
drives, adjustment mechanisms, etc 

Pie graph A graphic procedure lor displaying pioportions of a total amount by 
marking off sectors of a circle that art proper lionatclv equal to the relative 
size of the amounts in question 

Population Used in an abstract sense m measurement and statistics to indicate anv 
given group ot things (pupils, schools teachers experimental animals, etc ) 
Often means the totality of anv group in question as opposed to a sample 
of this group 

Synonymous with unnerse 

Positive discrimination In achievement testing, the characteristic of a test item 
(usually guided response) to be answered correctly more often by students 
making high scores on a test in question than by those making low scores Test 
Items must show positive discrimination if they are valid for the dimension to 
which they relate. See Negative discrimination 

Power tests Specifically, measuring instruments whose items show increasing dif- 



GLOSSARY 


455 


ficulty when arranged in serial order. In general, measuring instruments that 
are not timed and are designed to measure the extensiveness or depth of an 
individual’s knowledge or skill. 

Practice effect. It is known that a performance of any task affects a rcperformance 
of that task, usually in the direction of improvement. “Practice effect” is the 
term for the significance of such reperformance when the same test is ad- 
ministered to the same individual more than once. 

Pretest. Any measuring instrument (usually an achievement test) administered 
prior to a period of instruction, an experiment, or other circumstance of 
interest. As a rule, pretests are used to establish the initial status of pupils so 
that the amount of their learning may be judged from the results of a later 
retest. 

Probability. As applied to behavioral measurement, the concept that any measure 
or statistic is somewhat subject to chance variation. Hence it deviates from 
some theoretically “true” measure. Such deviation is commonly called error 
and its probable extent can be determined and stated mathematically. 

Probable error. A statistic derived from the standard error used to establish an 
interval estimate for which there is 50 per cent confidence. For normal dis- 
tributions, the probable error is equal to .6745 time the standard error. 
Example: If the mean of a sample is 47 and its probable error is 4, then the 
chances are 50 out of 100 that the true mean lies between 43 and 51. 

Product analysis. A basic procedure of educational evaluation in which the things 
that pupils produce in the course of instruction are appraised in appropriate 
ways and given scores or ratings. 

Example: Compositions, outlines, drawings, wood work, etc. 

Product-tnoment formula. A widcl\ used formula for determining the coi relation 
coefficient. Let Zj be the standard score lor variable x and let be the 
standard score for variable y. If the pairs of </s and r,/s for each individual 
arc multiplied, then added tor all individuals and divided by the number ot 
cases, the result is the product-moment formula for the correlation coefficient. 

Correlation coefficient (r) - 

Thus, according to the product-moment formula, the correlation coefficient 
is the mean ot the set of products of standard scores tor the two variables. 

Profile. An analytic graphic presentation of a pupil’s scores on a test battery, scores 
on parts of a given test, marks in several school subjects, ratings on several 
personality variables, etc. 

Prognosis and prognostic. The matter of predicting the future accomplishment 
of individuals on the basis of systematic measurement, usually guided response 
testing. 

Synonym: Prediction and predictive. 

Prognostic tests. Any measuring instruments that serve effectively in predicting the 
future accomplishment of individuals. 

Example; Tests of musical aptitude, intelligence tests, college entrance ex- 
aminations. 

Synonyms: Predictive tests, aptitude tests. 

Projective techniques, methods, tests. Free-response procedures for behavioral 
measurement that present an ambiguous, vague, or unstructured stimulation 



456 


APPENDIX 


to th^ testee His response (oral, written, or graphic) then is presumed to 
reveal or project his personality Used extensively in diagnosing mental dis- 
orders 

Example The Roischach Ink Blot Tist the Mintay Pictures 
Psychometrist A person who engages profcssionallv in psychometry 
Psychomt try The body of concepts, procedures, principles, devices, etc , con- 
nected with the measurement of behavioral or psychological phenomena 
Quartile Any one of three peiccntile points that divide a distribution of scores 
into four equal groups or qu u Icrs 

Q (qiiartilc dcMation) A common index of vaii.ition or dispcision m a distribution 
of scores It is the distance between the upper and lower quarlilcs divided by 


two It IS sometimes called the semi interquartile range Q - ^ 

Example If, in a distribution of scores ha\ing a range of from 32 to 98, 


05 = 71 and ~ 53, 0 - 9 or 


71 


53 


0, (first quaitik ) I he lower quaitile or 25th percentile, which marks off the lowci 
25 per cent of a distiibuiion ot scoies 

0 (third quuttik) The upper quartile or 75th percentile which marks off the 
upper 25 per cent of i distiibution of seoies 
Questionnaire Anv of a number of papei forms used to pet information from a 
person 

Quintik An> one ol four percentile points th U divide i distributior» of seoies into 
five equal groups cvciv twentieth percentile The fust quintile is the 2()th 
percentile, the second quintile is the 4()th perctntik and so on 
Random sample A sample selected from a hmer popul ition in such i m inner 
that evciy individu il or object in the popul ition h id in equal ch ince of being 
chosen in the sample 

Ranpi The difference between the highest and lowest seoits in i given distribution 
of scores 

Example If the highest seoie in a distribution wcie 74 and tht lowest were 
30, the range would be 7 ? — 74 — 30 „ 44 
Rankins, rank numbers I he process of ordering the constituents ot a group in 
terms of some dimension Rank numbers indicate the relitivs position ot the 
constituents 

Example Finding and stating the rank of students as to skill m tennis 
Rate tests Tests which measure the speed at which individual cm perform cer- 
tain activities, as reading, t\ping shorthand 
S>nonym Speed tests 

Rating A direct appraisal of a dimension in terms of some descriptive scale or 
verbal classification scheme 

Example Rating the citizenship of pupils fiom pooi to excellent, rating 
school plants as unsatisfactory, satisfactory excellent 
Rating scale A verbal or pictorial device used to facilitate and to record ratings 
as described above 
Example. See page 56 

Raw score The first quantitative untreated result obtained m scoiing a test 
Example 85 right on a spelling test 



GLOSSARY 


457 


Readiness tests. Any of a variety of measuring instruments and procedures used 
to sec if pupils arc “ready” to attempt to learn a given subject, most commonly 
reading. 

Reading age (RA). A type of “norm” score derived from standardized tests used 
to indicate a pupil’s reading ability in terms of age equivalents. Reading age 
means the age group whose average performance is most like the performance 
of the child in question. RA may be established only in terms of a given 
standardized test. 

Example: Mary’s reading age is 12-3, or twelve years and three months. 

See: Norms. 

Reading grade. A type of “norm” score derived from standardized tests that states 
a pupil’s ability to read in terms of grade equivalents. Reading grade means 
the school grade whose average performance is most like that of the pupil in 
question. By interpolation, the reading grade may be fractional. As with 
reading age, reading grade refers only to a given standardized test. 

Example: John’s reading grade ivS 6-7, meaning like the average child who 
has been in the sixth grade for seven months. 

See: Norms. 

Reliability. An essential characteristic of a measuring instrument having to do with 
consistency and dependability. An instrument is reliable to the extent to 
which it yields the same results when reapplied to the same dimension of the 
same phenomenon whose status has not changed. Reliability is indicated by 
high coefficients of leliability and by K»w standard errors of score. 

Report cards. A means of communicating pupils’ pi ogress and achievement of edu- 
cational objectives. 

Sec. Appendix C. 

Representath e sample. A sample of a population so drawn as to reflect accurately 
all essential and pertinent characteristics of the population. 

See: Random sample and sampling. 

Representative score. Any single scoie used to represent a group of scores. Ordi- 
narily this may be the arithmetic mean, median, or mode. 

Retest. A test readministcred at the end of a peiiod of instruction or other activity, 
the results of which arc to be compared with an earlier administration of the 
test. 


Rho (q). The rank-difference measure ot correlation. Individuals arc assigned ranks 
with respect to each of two variables, and for each individual the ditfercnce 
id) in rank is determined. These differences are squared and summed for all 
cases and substitution is made in the tollowing formula. 


Q (rho) - 


f) 2: d- 
A/(,V-“=T) 


Sampling. Selecting a portion of a population on which to base estimates of the 
population. 

Example: Political polls, test standaidization samples, ore sampling. 

See: Random sampling, representative sampling. 

Sampling error. Errors due to the chance factors in random sampling. vSampIing 
error is usually estimated statistically by the standard error. 

Example: I'he difference between the mean of a sample of a population and 
the mean of the entire population. 

See: Measurement error, standard error, confidence interval. 



458 APPENDIX 

Scaling, scale nuynbers. MeasuremenJ in ternts of defined ^eci$e Units that 
represent given amounts or degrels of some dimen^idiiiScalei numbers indi< 
cate the number of units and hence the l^mount oT degree of'thi^ dimension. 
Scale numbers must refer to a fixed ^idt or reference, usually a zero. 
Example: Measuring height with a ruler, speed v^^i^h aj^vateb, and intelligence 
in terms of deviation IQ’s. ^ ' I . - ^ ^ 

Scattergram. A two-dimensional chart used as a hasis? a correlation 
coefficient for two variables, x and y. The chart tpiidsts of a horizontal scale 
for the x-variable and a vertical scale for the y-vaiiible. Pairs of x and y 
values are plotted as pefints or tallies on the chart, Thie scatter of points gives 
a visual picture of the relationship between the x and y Variables, 

Example: See page 176. *' ^ ^ ^ 

Scoring. The process of assigning a score (usually^a numbdlt* or letter symbol) to 
a test or pupil product. For a test, this is ofteh done by comparing ’a paper 
with the key, marking the questions answered correctly, and adding up the 
total. 

Scoring stencil. A mechanical device for scoring tests. 

See: Page 117. 

Screening tests. Procedures used by colleges, armed forces, and other agencies to 
eliminate persons who fail to exhibit certain minimum qualifications. 

See: Admissions tests 

Self-evaluation. Any of many concepts and procedures concerned with an individ- 
ual observing and judging his own perfoimance, achievement, or adjustment 

Short answer items. Types ot guided response questions to whicft a pupil may 
respond with one or a few words, the significance of which is exactly pre- 
determined. 

Sigma, o. In statistics, the small Greek letter o (sigma) is a symbol for the stand- 
ard deviation of a theoretical distribution. Moie particularly, and with sub- 
scripts, it symbolizes the theoretical distribution of means, correlation coel- 
ficients, etc., and also stands for standard error. 

Significance of differences of scores, etc The degree to which chance factors 
probably did not produce the difference observed. 

See: Confidence interval, prohabilit\, level of confidemc, measurement enor. 

Skewed. The characteristic of some disti ibutions of measuics to cluster not at the 
middle of the range hut toward cither extreme 
See: Page 149. 

Sociometry. The concepts and procedures having to do with the measurement of 
personal or feeling relationships among members of groups. 

Split-half reliability. A coefficient of reliability based upon splitting a test into sup- 
posedly equivalent halves and correlating the scores on one half of the test 
with paired scores on the other half of the test 

Standard. See: Evaluative standard. 

Standard deviation. An index of variation in a group of measures. It represents the 
square root of the mean of the squared deviations of the individual measures 
or 


s — 


/IV- Where d is the deviation of a score from 
the mean. 



GLOSSARY 459 

Standard errhr. Ifhe estimated stancterd devi^tidn of an imaginary normal distnbu- 
tion of repeated sitoplings. The standard error is used as the basis for comput- 
mg confidence intfervsels.or int^al estimates of true values 
Example: A student's test $pore is 78 and the standard error is 4. If the stu- 
dent were retested mSfutely^ He test scores would form a normal distribution 
and the standard deviiltion of this imaginary curve is called the standard error 
Consequently, the. chattjegl are 68 out of 1(3K) that the student's true score lies 
m the interval 74-8?, 6r the 95 per cent confidence interval would be 70 -86 
for the true score. 

Standard score (z score), A general term referring to any of a number of scores 
that indicate how many standard deviations a measurement is above or below 
the mean. It^is^fouivl t>y determining the difference between the raw score 
(jc) and the mean C^) and dividing by the standard deviation (s) 

- r= X — J 
S 

In many standardized tests, the z scores are converted to positive and whole 
numbers by conversion formulas. For example, a z score of — M, which 
means that the student in question is one and one-half deviations below the 
mean, might be converted to a test ‘standard store” of 35 bv the following 
Lonversion foimula 

(r store of M)J0 ^0 3S 
Synonyms s(ore, z scon 

Standardization population The individuals who were administered a givtn test 
prior to publication and on whose cores aic based the norm scores for the 
test 

Standardized tests Tests, usually published, which have been preadrninistcred to 
a population of known tharacteiistics and yield scores m terms ot this popula- 
tion This population is st letted so as to be a repiesent itivc simple of the 
total population tor which the lest is designed 

Stanint Any one ot nine intervals on a scale of standard scores The stanine (ab- 
breviation lor standard-nine) scale spans the normal curve in nine intervals 
ot size equal to one halt o* i standard score The stanine intervals have values 
Irom I to 9 and the midd c interval extends from st indard score — ’ 4 to 

-t ^ 1 

Sec Appendix D 

Statistic(s) Any derived quantity obtained from a set ot raw scores or measures 
F xamples N mean standard de' .ation, modi m mode quartilc deviation 
correlation coefficient 

Statistical signific ance A difterc is said to be statislKally smnificant when there 
IS only a slight probability that the difference was due to chance variation 
Formerlv, the term was used only when the size of i diffeiencc exceeded its 
standaid error by threefold 
See ConfidttH e inter \al, confidence kxel 

Strip key A simple type of key for scoring a guided i espouse test in which correct 
answers arc shown on a strip of paper rather th.in on a whole test The strip 
usually is the an^^wer column ol the test 



460 


APPENDIX 


Subjective tests. Tests , scored without a key and on the basis of the examiner’s 
interpretation and judgment. Usually they involve long written responses. 
Synonym: Essay test. 

Survey tests. Measuring instruments or procedures designed to measure broad areas 
of knowledge or ability in terms of one or a few general dimensions. Opposed 
to diagnostic tests, analytic tests, and profile tests. 

T-scorc. A special conversion of z scores or standard scores in order to eliminate 
plus and minus signs and decimals. T score conversions are particularly ap- 
plied to z scores that arc medians of normal curve intervals representing cate- 
gorical ratings like good, fair, excellent, and so on. The median z score is 
multiplied by 10 and added to 50. 

7 score ~ lOr -f- 50. 

Tests. Any of a great number of procedures in which individuals respond to com- 
mon stimulations and react in comparable ways and which yield a measure 
of the individuals with respect to one or more dimensions. 

Synonyms: Examinations, instruments, quizzes. 

Jest plan. In standardized test construction, the rationale of the test, including 
what is to be measured, how, and how expressed. 

Test protocols. Directions for interpreting, classifying, and oi rating responses 
to individually administered intelligence and personality tests. 

Tetrachoric r. A special correlation coefiicient that is computed for two variables 
with the variation of both expressed as dichotomies. 

Example: The correlation between good or poor students and their passing 
or falling a test question. The assumption is made that the itidcrlying dis- 
tribution of both variables is a normal distribution. 

Thought questions. A loose term meaning all sorts of test items that involve rea- 
soning as well as memory and that usually require extended answers. 
Example: What is the relationship between wShakcspcarc’s plays and Eliza- 
bethan politics? 

Timed tests. Tests for which students ha\c an allotted time to respond. 

Top. The highest degree of any dimension that a given test can measure. Usually 
a valid test needs to have more top than any individual to whom it is to be 
administered. This is indicated by an absence of perfect scores. 

Synonym: Test ceiling. 

Traits. The elemental attributes of pci'sonality according to certain theories and 
test plans. Comparable to “factors” in intelligence. 

Example: Honesty, sympathy, etc. Aggressiveness, dominance, etc. 

True-false items. A form of guided response item in which pupils indicate whether 
they think statements are correct or rncorrcct. 

True score. The theoretical measure that a valid and reliable test would produce 
if no chance variables affected it. Statistical! v, it is the mean of the dislribu- 
tion of an infinite number of measurements made of the same thing. 

See: Measiiicment error, standard error. 

Universe. See: Population. 

Untimed tests, lests which all students are expected and allowed to finish. 

See: Timed tests. 

Validation. The process of establishing on the basis of empirical data the validity 
of a test, usually a standardized one, by comparing its results with one or 



GLOSSARY 


461 


mor6 criterions. Typically involves, as a minimum, item analysis, correlation 
of results with other lest scores, analysis of distributions of scores, and dctci- 
mination of reliability. 

Validily, The essential characteristic of a measuring procedure or instrument 
actually to mcasuie the phenomenon and dimensions that it purports to 
measuic. 

Variable. In measurement, a term tor any significant propcity, aspect, condition, 
etc., of a phenomenon whose variance alfccts variance in the phenomenon. 
For example, variation in the school achievement of pupils is affected by the 
variables of intelligence, age, motivation, instruction, etc. 

See. Dimension. 

Vocational inierest te\t\. Tests designed to determine the specific vocations or 
occupational areas in which a peison may be interested 

Z score. See: Standard score 



APPENDIX B 


ANNOTATED BIBLIOGRAPHY OF SELECTED 
PUBLISHED TESTS 


1 xplanation 

1 Tests listed iic eonsidercd to be of ivcraj^c or bcttci validity ^nd usefulness and 

to represent tht vaiiety of a\ iilable tests Detailed malvscs of these and all other tests 
are to be found in Huios Mental Miasurcnunts in leviews in periodic ils 

in publishers e italogues and in manu ils for the te^ls 

2 An effoit has been made to eo^el all ige ind ci ide lc\el^ ind most subjeets ot 
importance in the schools 

3 The inclusion of a test does not indie Lte i s unqualified endoiscrneiit nor should 

in omission be considered to leflcct on the test omitted • 

4 Fxeept where theie ire eleir ind speeitie tattois be inne on i tests vdidity 
validity IS not discussed and m iv be presumed to be con istent with picv dent st indiids 
for standardized tests 

^ Reliability coefTieients aie st ited where possible and ii iless st ited otherwise, in. 
dtiived fiom split half eoefficients of eoirclition In some cases diti beuing on rcliabil 
ity are vague but theie still is an indication of reliability equiv dent to that of eompai ible 
tests Here the statement satisfactoiy has been used in lieu of i coethcient Wh^ic no 
dati are available this is tated 

6 I he comments column contuns an analysis of subtests or dimensions eoveied b) 
the test, types of score and any pirtieularly iraportint fcituies Del iiled leviews uc 
beyond the scope of this bibliography 

7 Cost of tests IS to be found in publishers citilogucs ( ost is mentioned only wheie 
it IS unusually high (as for ecrtain individuil tests kits) Sep ii ite answer sheets foi 
either m inual or machine scoring aie avail ible for most tests IisIlJ ( it dogues will 
specify 

8 This list should be used onl to indicate possibilities and to oidii specimen tests 
and manuals nat to adopt tests without further inspection and eviluition 


462 



Oentra! 4ihi^\efy^en Batteries 


0 c 9 

° o 2 i 

°%~ 

1 " s 

»i- JS c 


^ c ^ 

cS S 

S c. > 

x: jc o 


C ^ 
< £ = £ 


O V 

o &- 

t 


?§■§ 


£ -g £ ^ 

3 y ju = c 

^ a:: ;»► a - 


•2 § o' 
x: J 3 'S :i S 

J < uJ {S 


o 'n ^ ^ ^ 

a 8 i> ;< VC 

8 £ ^ ^ 

o c 5 ^ u < w - 


re ^soning or Critical thinking 

It i!> intended to extend the tests 
doNvnA^ard into the middle ele- 
mentary grades They are designed 
uitimatth to replace the ACE as an 



a 

o 

U 


S 

a 




8 2 
® g 

^ c 


^ g 
- ? 
c c> 


^ on 
5 £ 


s a 


00 

< 


Xj 

D 

cu 




u <5 

4> 


a ii 

o -e B 

C Eo CO 

n) Tj a 

3^ ^ 
c ^ - 
o (U J 2 

l_ 

(U 


o 00 

bO C 

C "" 

*5 ^ 

t c 

2 S ^ 
•a 8 -a 

3 rt 5 


O ^ ^ 

«J r^ 

C 


B 
a> 
</) *0 

B 3 

a> VI 

^3 

P O 

^ a 

t> 4 > 

E « 

« I 

^ c 


^ G o •r 
.< C/} ^ UJ 




cd o 



a 

dj 


o !U 


«n 

B *0 

’^'g 
^ ? 


B 

<u 


o 

ON 2 


! ri 


4^0 


Q- ^ 


O 

- ° ' cl 

•— ^ 73 c p 

. I ^ E ^ 


P 

O 


o 

CQ 


O 


o 

I ■§ £ C V 

c _ 2 fe i 

“ ■!> § S i: 

W 5 — O ed ^ 

uj DC U ca — 


1-21 


a> 


(/} i> 

a 3 

O ”0 
B ^ 

*-• s? 

c O 

<D E 

a ^ 

(L> S 
CJ Cd 
cd x: 


(U O) 
73 00 

^ C 

t- o 
O J 


<u O 

B o 

"a _ 
s ^ 


S 2 ^ 
o E o 

00 (/) (U 

cd 4> XI 

I 2 

e x 3 

Cd O 

cd — ^ 

— (U t /5 

^ Cd 
i/' ^ 

OJ P 

a> p: c 

r -B cd 

« 5 ^ 


2 C 


p 

V 

C 

P 


a 


. B 

a a 

o o 


O 


-"I ' 

^ ffi rr 


r 

o 

2 a 

11 

X y- 


C 




00 


^ ^ CJ r- -P Ji: M 

a-= 3 cg<io^!^’:^^ 

^ Cl- ” Tf 

o £ ^ ^ pq ^J 


O 


464 



(Every pupil Testb corlinued) Ad\anLtiJ 

Battery Giade 





c^ 

I 

yr 


u- 



I- yr 


C OC 


x: 

o 

^ o 
' o 


cr 







T3 

G 

cd 


4> 


cd 


CL k/i 
T3 

C " 
cd 

3 

U (A 


o 

> 


’T3 

a 

rt 


£ 

(U 

X3 


3 
X) 
cd 
o 
o 

Ofi 

c 
*5 rX 

il 


ao 


G 


’3 00 

5 s 

cd j— 

(/■ 

■3 

C 

00 cd 
3 

2 ^ 
a: 3 


S 8 

€ ^ 

•2 

^ O £ 
•n CS 

^•S S. 

»_i tr 

*3 4)^ 

2 £: a 

^ 3 <d 

td w 
Q *-* 4i 
J <1> O 
*^3 3 

« (U 


ad Uu 


r- C 

ON ^ 

ol o« 


3 

e 

o 


O M 


3 ^ 

r 'y 

^ "O 


cd 

3 A 

<L) O 

£ 

<u 


3 0 a. O 


2 0 5 0 


a- 

^ cd 


a s “ 

•S = s 

Si 13 =5 

n> o. 

tS tA 


-d W 

•o 5S 

_ rt 


d o) 

« § 


CA O 

.3 ♦J (A 

'dS 5^ 

"I 


g 2 3 


O 0) 


or 


t: s 
^ o ^ 

2 to 3 
e oi s 

•*-* *3 

3 ® O 

£ § £ 

532 


JO 

3 


H 


c ^ 

rt C 

i2 fl> o 

pS C IT 

O g Os 

£- > 'T' 

§ s i! - 


CL O 
U 

£ o ^ 
»f3 O 

tr <L> "o 
2 *2 
^ 2 lA 

g So ^ 

^ ■S 

-d X) ^ 

^ Ai 

3 d 
_ C GO 
GO ^ 

e o *3 
3 3 


O 

3 , 

(U 

00 

-J- 


- £ 


c 

Cd ^ 
(-1 OS 


XI 

3 


£ c ^ 
iZ <L» W 

^ E *2 


o 

o 

ffl 


c 


(U pj; 


£ x: 

cd o ^ 

^ H 


^ ro 

GO »/-> 


466 



e 

B 

o 

U 


3 

3 


bO 

a 


& 


p 


2 E 




CO 


B a 
o , 
o 00 
c 

JC ^ 

2 ^ 
w. 

Ui ^ 

co b 

cu i2 


& 


</' 00 
Si c 

T 3 — 

2 « 

(/) 

(A 


u ^ 
o a> 
B 

C u. 


u u Zb 

E 

>A 

•B ^ 

2* 

2 -y' 5 
bo y ^ 

« a> *3 

03 n c 
a, rt 


"O 

W 

S 


o 

V|-l 

J' 

< 


O 

rt 


•a 

4> 

£ 

— 

jS 

3 

Ci< 

o 

CO 


o 


-3 


*- CJ 



I 


I 


I 


1 


C E 
E o 

Hi »Q 

§i 

^ § 
u 

^ 'ti 
ro O 
X. 

c 

y 2 
C D. 
3 (u 

^ ili 

CJ 

o ^ 
3 ^ 
t> 

- I 

C> 3 
a, cfl 


► • «» 
$ Q i 

g ^ 

5 E S' 

o 5 9 

o >» a 

X o 

» u e 

C C 3 


o t> 

di 'O 
-"' 2 - 
w O- X 

» CO 

- i £ 

, Oi CO 

':i E ^ 

c P c 

TJ 5 tal] 

3^7 


SN 

I 


X- B 

2 c 
c _ 


/ ri 



467 


to minimize effect of content. Care- 
full} constructed, difficulty of valida- 
tion on universal criteria, effect of 
mental set on scores unknown. 




Form n91- 
Business 
Fundamentals 
and General 
Information 



Bookkecpiiu 






SR \ Dicti ScJtncL Cji (< s > 1 to m n SO I) i r i ti' ■> b\ an> steno 

tioi Skill »<<. L ii^h ^(l ilts Lrtiti d LF ipM tcm 1 consists of «de 

1947 Two Associitts uin ii ii nii'^sin^ woids Sl'>'’^per 

parts Speed records M 90 per 2^ inswcr 

\Lcuracy P ‘ ^ 



Lnglish 

Oenerdl Achievement Batteries) 


Z :s o 

R S » = 


^ S "2 

O o o ' 

lii 

s «J J2 


"C U) 2 

J O K O 


S-:= 

o (U 
^ a 


W (U 3 
- D. O 

1^ y* >/• 


-O 0) <L> 

H C/3 


jr ir c 

•V* 5:i — = 

fc-xicc-r-r 

3 3 o o o 

m d. i- o ^ 


5: t> ^ w 

® w S 

5 J a H 


w. ^ «r> 

U y) I 

a 3; o 

o w) 5r 

o c o* 
O U - 


TJ 

r- C — 

C Cd 3 ‘n 

Wo o I 

:: u ^ ^ 

% B ^ ^ 

- C c ^ ^ 

i ^ S Ji . 

§ S I & s 

j a < H 


S G 

f2 

4> :5 ;2 

.2 a o O' 
U 1/3 c ^ 









471 


-d c 2 


o c a\ 
H jU -- 


Richardson) IH Sentence and paragraph struc- 

ture 



foreign Language 




■5 = o e 


^ ^ , O 

E 


a> flL> ' 
j: t? ^ 
^ CO c 
U 03 
i W) -C 


O CXi 3 ^1 

S fi E rl 

O^HC(/3-fc-'T 

o o 03 (u to a\ 

U Uh *-J H hJ ^ 


*c 3 fco 
I ^ , r-^ 

^ c o I 

!L» - - rx 

C <u "5 

< 0 U - 


472 


Nondidgnostic, bi't best present 
means of measuring children's hand- 
writing 



Health 

Kilander World Book Hifeh School 40 min t8 Percentile norms Tests knowledge 


I I i 


I ^ 


c ^ 

^ o 
< 


^ 3 


•JC (y 3 

CO o, o 

M wT (J , 
^ w CO 


g 2 o 
«= W o 


^ o D-1 ^ 


S ^ 

S g S 

K H 


-0 ^ 

0 3^ 


^ Ti 

U U — i 


I - ;r 

^ o 2 
^ -o 

w cJ 3 

t/5 C/D 


473 


Food ScIlc High School 60 min 

tion and 

Preparation, 

Form C 
1948 



(State High continued) 



c 

6 

o 


0 

o 

x: 


u 




c 

e 

in 

rr" 


OC 

r- 

~r 

<u 

-d 

u 

c 


c 

B 


= 5 

g E 




oo 

I 

r- 


o 


o 

O 

o 

o 

xz 

rC 

vj 


txl 

t/i 

jr 

x: 

00 



X K 


“T 3 

s I z 

c p, -a 
C o 
^ o o 

pu ® Uh 


g 00 

U Tt 


(U 

bO c <* 

g ”] 1> c 

^ O E E s 

O ^ O C^ 


° 2 


to o ei) 

E S E c 
3 £ 

^ ^ S> o 

X -£ u. 


U ^ cq 

65 S 

:S 


4 ^ P On 

O — 


00 

C 

3 

O 

X 


n 

"c E 92 

C U Tf 

-Tj o a\ 

Mh 


474 


e 

E 

o 

u 


s s g 

t/i 2 

C 00 c 
“ C s 

g E « 

E S 
o s 
-s ° ^ 

C 

5 x: 5 

c « " ^ 

S " ^ -s 

o , ^ 

C 00 -O 
O c o 

" MU S 
w c o x; 

Su-I ^ 

e 2 = 

w ^ $ o 

OL. ^ -X 


i| 


c c 
o o 


T3 

<u 


3 

1» >-> 

3 ^ 


•a « 
ce 2 




y ^ 5 

§>« 1 
e i -3 

cd 


bO a 

O 

00 


S" ** 


Ou g C 
4 > y <U 

60 n 3 


<I> <3 S 
id *0 


<L> 


o 


u- g (U 

-5 c 

X r: 

4 ; so h 


o 

o 

j:: 


I 2 S 

o H 


•o ^ 

^ -S 
S 5 


< 


S ^ £ w ^ 

^ id :5 6 

u U o ^ 

flu >, a 


S 

- dj 


- i 

oo C 

-d 


"C 

-a 

c 


oo 


c 

r — 
sC r' 


U 3 -U 


4^ O 
O tN 


C 

Cd O 
a> 

6 S 


>s 3 O 

^ 2 
3 3 
O’ a 


■o 

l-l 

cd 

-o 


•o 

c 

-d 


c 

c ,- 


fd - 1 * 5 

§ g; 


3 

x> 


o C 


8 ? 


u 

. & 

Xi W 

<u V 

x: N c 

*- c " 

c W> S 

« o a> 


3 3 0 

•E 5 

Q al r: 


-3 — 
:2 3 


c 

o 

^ c* *-* 

WJ 3 g 
O £ g 

® ? s 

'S u, u 


o o H 

c 2 u 


u_| C 
O O O 


11 

|S 

1 i 

-si 


^ cd 
tn X) 
P 


™ O 

c ^ 

x: 

22 c/) 

E ^ 

3 O 


u O 
P 


-3 

C 

Cd 


-fl '*d 


— o- 


« E 2 

•I c a — -o 

4 | S 3 < i 

c ;;;i ^ - 

M I 5 C £ -£ 

* £ 5 2 3 2 


i ”* 3 

O -J) i;_ 


"2 x: 
- <30 


(u O 
— ' CO u 

c X 


State 
s Col 
Fmpoi 

J 



^ s 

o 

x: 

CO 

U 

3 

d 5 
3 3 

t/ (U <*-, 

x: o 1 

3 CO 

3 s 

C o 

d 

C> 

o 

C « 

4-1 ^ 

C< ^ <Li 1 

Cd J lJ) 

s 

•o S 
lu H 

<u CO cr 
r> w 

V? DC < 

"d o 

U h 


^l<1” 

£p 3 c£?ri 

I -2 £ o I g; 

<r o H 3 ti-i < - 


I? ii 


475 



administered Obscr\d- 


QJ ^ 

cd c 

u O 
“> 4 = o 


ii 73 O 

£ E 5 o 

£ H o i 
^ D,J= E 

3 t S “ 

Q 5 o -S S 


o — ' 2 - 
Cu 

rt I/- O vj 

c 0^ ON 
dJ 1 -I- I 

^ ^ — 30 
<00^00 


^ T 3 
^ £ J 

zn 

5o C lT 
CJ <U O 

O S 3 


j= 

Clj GJ ^ 
3^0 

O C 

g o ON 

5 fco — 

73 — 

§ B ?. 


S I/* "* 

C to r rn 

4 ^ t'-' i; ^ 
I H ^ — 


ffi u t 2 2 : 


476 



(Holzin^cr L()ntinii;.J Coutlations \Mth academic subject 

grades aiOLind ^0 with nonaca- 



477 


Subtests are figure dividing, reverse 
drawings, pattern svnthesis, move- 
ment sequence, manikin and paper 
folding 



Furnishes MA’s and Ratio IQ’i 


£ 3 S 

■" ■« £ 

1) c/3 

e -S’ « £ 
> 

S t 

a (T ^ 

o e g ^ 


u 

c ^ ^ 

ti ^ 5 n> 

C p ^ 5 

3^ S ^ 

^ TD "S ^ 

^ £ ‘S £ 


<U C .> 

3 5^ 1 3 

2 ^ a- §• S' 

O ^ too *2 O' 

*“ 5 E g " 

^ fe ■? 

O ^ <u o c 

" ^ e 2 

S £ ^ “ 

Q ^ « :5 


o ^ ^ 

^ C o- c 

Q ^ >3 (U 

■2 a S f 

“ 2 s 3 i 


W) 73 JM 

I g-2 

I §2 

I a> :i: 

Ci. , 
<u c/^ 

, O u- 

J ^ w 
o 2 

rt Cl. ed 

cx . y 


2 rv 
c c w 


, ‘^ </) o 

' 73 iTk ti 


o p 2 
§''^ c 

-a 2 O 

c £ o. 

2 e jp 
S' 3 SP 
Q £ ^ 


'll £ C ^ 

4-1 U, \ W 

Tj O ^ C) Z' 

X c o 

2 - 

3 B i 

■< ^ "Z JT 


2 

C dj o 

fl> jr Z* 

t/5 pc ^ 


G ^ oo 


73 t-. '-' 

<u O 


G 'fl 

CtS c/5 t/:> J 


reasoning, number, wora nuenuy 
several abilitv quotients (like IQ) 
and percentile norms 



(SR A Lontinui^J) ^ soLCiftt nuntal fjctors Some 

Kvicwers question the validity of the 



479 


(WlSC) vet as well accepted as the Bmet for 

school age children 

Individually administered by qual- 
ified examiner only 

S22 00 per set 



Age or Grade 

Title Publisher Range Time Reliability 


a 


C c 

o 

o ^ 

<U — 


o 


C c;]) 

o 


Cj OJ 

5 £ 


§ I c = 

o ^ ° 

o 


^ ^ o 


c 

•5 


o- , 

E 

o 


o 

U 


u s c 
ta ^ z: 


tii 1-^ 

^ Cm ^ 

c c *-> 
O C ^ 3 
^ z: o V 


C3 X) jz 

y o ^ 

3-:i 

. <73 
O X 
£ o tn 


nj Xi ^ 

Cl ^ 

6 ^ ^ 

'i iz ^ 


S' £ 2 

-^ -^ £ r 


c 2 

p 


S 

3 


^ I 


cd ^ 

o X S' 


£ a 

o X 

*= c 

z; 


- ^ "5 
o ^ ^ 


^1 





tr 


X 

p 

Pu 


o 

u 


<* 


c 

X 


£ 


c 




tr 

c 



o 


o 

t/** 

Cfl 


y £ ^ 
£ - 
I 

« t;! C 
C <u 

< H £ 


c £ 

P & -aT ^ ^ 

0 I c3 I y 

S 3 ^ ^ -g 

01 S m z < 


« 


fj 


«- - u, 2 

^ ^ -2? 3 ■ 

<1- _ <r- O' ' 


^ - o 

-c -3 wn 

4=: = I 

_ ^ O oc 

^ p ^ 


^ y ^ ^ 

O 1-^ ^ '1^ CT\ 

U ^ < aaa 


480 




481 


which iru^ludcN arithmetic, intuitive 
gcomctiN <iiniplc algebra, and nii- 
niLiicil tiigonometiv One third of 
it^ms on lower grade;^ anthnoetic, 
Linph IMS on mechanics and memorv 



{/J 

c 

o 

£ 

E 

c 

w 


o 

c 


c> 


Q, 'q 

S cd 

1-l\ 

T) ^ 

E c ^ ' 
"cio 

c 

C ’ 
ca 
a> 

E - 

c 

IT 

(U o 

cd 


r 'O 
$s 3 

S I 

3 ^ 


E 

o 


a! 


^ -a 
£X ^ Q< 
C u O 
C CJ k- 


C CJ w- r; 
Uj ^ a w 


c c 
E £ 


r 4 r 


COO 


o 

o 

aQ 

2 

i- 

o 


g i 
b 

i/i 

*5 ^ 

— Cfl 

c o 
o c 

H O ^ 

O — n 

O , 

O t 

Urn* 3 

^ C-. 
»“ 
c 

O - ' 


o 

£ E 
«) " 

£3 

VM 

3 ^ 
E 3 


TS 

o 


o ^ 


flj 3 
t_ **-< 

£ cd 


— DC 
3 E 


c y 
o 


*0 ^ 

■d — 
u- 3 

O ca 


O 

.c 


(U 

•2 § 

£ “ 


■3 


^ ^ 6 <* 
c /5 < H 


E 

: E b 
P £ 

I Ox ^ 


I? 


^ c 
C 2 
_ri 3 
-5 3 


« J — — 

£ £ 3 1 

1 j ^ ^ - 


15 iS' 
rt ^ C 


u ^ 
*= 

I- 


«- O o 


u. * C 

o >* ”• 


> a 


u 


4> 33 ^ P 


ii £ 

c J 


^ 3 S 
o 


C- 

O 


c -C , 

-5 " 'S 

b -c " 

^ ^ ^ 
o ^ ^ 


^ ^ - 
3 § o 
g S s 

'-> "S - 


3 C 

1^ (jj qj 

cd 3 L™ 

G -r 


— X 1 ^ 
^ 3 vD 


E cr 2 


— — 


o 

o 

X 


£ X 


bo 

3 

[JJ 

C 

o 


H <u 


E =3 

I ^ 

Z. y' 


c = "S 

-3 ^ ^ a> -c: O 

- ^ § c ""I 

^ 5 3 o 

E ^ ^ - 3 ~ 

2 ? 3 'o -E H ^ o 


482 



Personally -Character-Interests 



•3 

u 

u 

0 

0 

73 

> 

u 

73 

G 

a 

(A 

u 

G 

u 

•3 


T3 

G 

60 


o 2 

u ' 

(A a 
C c ed 

O rt 75 ' 


(/) o u 
^ ^ 
§05 


— ^ 5 

eo ti 

S o 
o 2 '= 

S 8.| 

tn K 
I- = 

O .P W5 


— O ^ 
Xi o o 

rt «J ^ 

•U S " 
^ ^ 
C ^ 3 


^ 2 W. , 

o ° 2i 

ii 

^ ^ -C 
- -3 "" 

C I 2 


S w !- 
3 : S tc 

3 ^ ^ O 
2 {/) C 

a> 3 cn 3 

^ -S' CTl ex' 


73 u r-. 

C O ^ 


3 X! 
tS O 

3 *3 
S £ y 
/ ti 

4-* 'd ed 

3 I-. 

ii g s 

•3 « “ 

cd <2 
^ 3 c/) 
t: cr o 


5 -3 *0 
Sue 
'S 


^ JD U 
c; 3 3 

g s 

^ ^ s 


'o M 

52 ^? 

< V3 K> ^ 


> fa's 

td o 
•c u y 


G ^ cd T3 

I ° § g 

cn t, ^ 
W U U <u 

U I- o. P^ 


483 



S 6 2 "S 

3 i ^ 

2^5 

S ^ 

^ o _ o 


I S 

S g 

o td 
c 


S «« 

2 e 

t/) ,:■ 

4) <u 

S. 2 

0) ^ 

•3 


•3 -O CT3 


£S ^ 

s 5 

u o 
xs ^ 
tS *0 

X « 
> rt 

Vi ^ 

“• ^ 

fi 

t- !>» 
X 


Pi 


G £ 

o Q 


^ 5 p; 


c 

O fl. 


&4 

■§ ^ 
— £ 'J 

2 ^ ^ ^ 


CL «X 


8 s 

U 'Tl 
ed (j 


O ^ 

< 


c 

S *3 

S J 

'ft 3 


•13 

C 


O -a 

O C« 


■§ 

P. 


X CO 

X S 

■s & 

&4 U 


O 


S3& g I §■“? 
§3 2 § S 5 ! 


>» T» 

■“ g 

■s . 

X 00 

1= S 

(U X 


^•1 

<u O 

CA 

U u 


2 3 

3 -a 

o ^ 


a 2 
*= £ 
" 2 
3 -c 2 
3 Cl 4) 

I § O 
2 


E 

o 

•3 X 

(U CJ 

t 5 jj: 

■^■3^0 

C CO 

CO +-• 

c ^ 

3 ^ g* 

tc a o 


C 


<u -> 


«o 

c 

o 

"co ^ 
o o 
O u 
X 


G X 

(U tc 
>■ O 
<u ^ 
'o) P- 
cO 

O g 

c 

-A ^ 

X o 

X X 
c 

4J 

O CO 


^ ? 

c 

73 

c 

CO 

^ fj 

u X 

(A 4) 

?! 

O u 

•g 

00 

c 

ti 

o 

CL 

X 

(U 

a: 

d ^ 

f 1 

u CO 
3 

^ 1 

u 

VM 

CA 

O -rt 


CO 

c. 


'T X 


if 


-o 

.3 X 

o ? 


£ o o 

<L» A y" 

(/■ Dt ^ 


•3 'tl 


0) 


(U 


w 3 

o 

^ i;; 

e ^ 

u 

o Cd 

c 

O O 


TO 4> j- 
>.6 1 
-c g-x 

*rt f—l r. 

S <u w 

00 CO CO 

CA 2 E 

e tA 

*3 

iJ X c 

X CL CO 


B 

2 S 
o 2 

g.a 

0^ VM 

O 

E ^ 

u 

o G 

> 2 

S I 

c>0 2 
00 2 

3 E 


CA O 
VM 

Hi ^ 

D c 


3 t3 

- 4» 

^ 1 

^ 3 


^ t- ^ 

Oi CA 

E X rvi X 

^ CO ^ 3 

rn I ’2 
X U 0^ -0 


(L> 

VM 

J M « 

X flH PiS 


cO 

c 

A 2 (J 
X ^ 

o y E oo 

" > ° 

l-U 1— < 


X ^ 
X o 

t - 

^ o 

'M> 


(U 

X 


484 




^^o^e lisLd in sociological studies of 
group attitudes and status than for 
measurmg indiMduals 


•kS 

c 

a> 

e 

B 

o 


t« C3 O 

"Is 

i o § 

9 fe 0® 

ca P r4 

« i 

1 6 1 § 

iB ^ ? 

p «*-• Q 4> 

S 2 '9 £ 

ul «« 

H ^ O g 


«*-i •« *s 
o g ea 

« " J 2 1 
-■§•§! 
U «* '3 < 

O SE rt ‘ 

^ M J 
« G <5 1 
.fl cfl - 

“3 g-' 

•£ ^- 3 

5 <1^ 

a D. w) 

u o £ ' 
x: >o fe 


G o 


— "S 'i G. 
*3 G w 3 
^ c )5 cS* 8 


o 

*« o 


(U 

b: 


0^ 

-S 
o ai 

s s 

a> 

00 


s I 

6 I 

o ^ 

Tf pJ 


■G 

C 

“TJ 

e>o 

<L) ^ 

O •n 

U « 


•§ 

04 


"S ^ 

o V 

<4-t ^ /3 

C Im 

§ C Si 

55 P ^ 


c 

-d S 

3*1 1^ 

P 5? 

^ 5 c ON 


H § 


60 
S 
B *2 
y o 

u C6 

§) n3 
ed U 
> 


111 


o -o 
& 2: 

•o £ 

S? 

G 4> 

^ O 

o 3S 

£ *0 

xa >P 

I'S 

o g 

£ £ t 3 ^ 
o o 
B5 G 


j. *o 
O c 

s « 

£ ^ 

o I 


ii 


5 S 


> 

X G 
u o 

Z H. 


^ £ 

c 

-o| 

s “ 

s ° 

O. s 

(u w *j 2 


5 B 

cd c 

c/) 

a> «> 


^ ^ 5) 

B xa o c« 
« >> ’o 2 

i 1 « S 

g ° o.-a 

t= •« M E 

a ^ y ^ 
^ c o 


■s 


js -a 

Cd C 
G Cd 


B 5 

O a> 


0) 

c/) xa 


(u o "g S 

^ H 3d 


o 

r 


B 

«= i "S 

^ 2 g 

O G £ 
r4 G 
»-H cd -J 


G 

G 


xi 
<« 'S 

O «w 

c o 

<0 cr 
-O G 
cd O 

OQ u 

60 cd 

< w B 


ON 

nA 


B* ^ 
"3 "5 

a 

?ij N 


•G £ 

d) 

l*f 

3 ° 


G ’G 

S'g 

S H « 


r^ dj cd p 

H bfl 


a> 


00 


dj ^ o CO 

B 2 


-*-* ^ 
o g 7 ^ 

- <0 G 

^ y ^ 

e ^ - 

^ 9* ON 

H < H — 


c 

d» 

O 


0) CO 


1 


3i< 

O 

o 


xa 


u 

s'? 

m 

£ B 


£o £o 


■§33 

Cd ^ 


■G e “ 

3 O On 

C/5 00 i* « ^ 

f -o o J B" 

^ Cd G X3 t/i 

G a> cd o a> 

Q Dd O H 


486 



^3 

c 

g 

& 

o 

U 


CiO B 
B U 

It 

0) C 

ll 


i § 

o -o 
o 


- s 


tQ 

< 


■§ 

cu 


T3 

O) 

3 

B 


3 

Q 


3 

X» 

cd 

O 

c/5 

B C 

S o 

G *5 

a> ui 

> a 

2 e 
•G o 


(U 

in TT* 


X! 

a 


o «« 


1 2 

a ^ 

;/) T3 


a 4> 

B CQ 

^ S K 
g ^ 

WH ^ « 

I ■“ 

.S o n 

a’55| 

CJ ^ 

« -2 
CJ 3 

2 1“ 

g u I 

^ jD c 

a ^ i2* 

. g> i 2 

£ I ^ I 

-S Ic "i 

H 3 TJ 3 


o 

B 

a <i> 

« G 

G B 
S « 
flj tc E 

g) a 
•0^2 


I 2 o 

U 

O cu 


tio 

< -t* m 


5> C 

a o. 


•TD 

(/3 

-o *3 
B rt 

2 W 

f= .2 
P Z 

U Q 


S B 
b< a> 

3-^. 

73 Cl 

B !:> 

cd 73 

73 G 

a 

a t/) y) 
c/i o 

■5 w ^ 
o G 

-C ^ -3 

s ® « 

2 « S 
a " I 

Q, ll C 

^ o o 

^ VM ^ 


^ 4; 

3 T3 

R] 

3 t. 


O 5 


2 w 

-1 2 S w ■! £ 

J y x: ^ E w 

<u 't; o 4i 3 ^ 

3 3 S O O S 

« a H u ^ P 


.a S 

a bo 

B 


*2 ^ 
cd ^ 
Cd (u ^ 

O ^ 


I- 


£ 


'S >* 3 


flJ 73 

«- jD 

ifi ed 

0) (A 

O ^ 
^ W» 

si 

g £i 


o 

2 


o « 


2 " 


a ;i: 

c 

M p o 

t) ® 

*5 E 

fc 4 i P 

O TD c 

G o 

S 3 k 

o cd ^ 

O l- c 

cd tao n 

P 'kC 


B 

G (/) 

*0 u 


Cd 

3 73 
73 O 

> *2 
•3 iS 


c^« O »— I 


>> 

i-i 

cd 

c 

a> 

G 

4i 

W 


O 2 S2 , 2 ^ 

3 Cd o « -E 2 

« a x3 g> G o 

U 72 U ja 3 > 

X5 Cd •-; 72 3 

3 3 W O O C 

pa a H C O P 


1 


73 

s ^ a 

PC 1 2; 

0) g) ^ 

ed 2 5 
O Q H 


487 



e 

E 

o 

u 




o 

E 

H 


a 


toll 

< 


Xi 

D 


T) 

a 

CO 


•= -E 

o 

*1 

c 

o cO 

~ (>5 


' o 

CO w 

O C3 II 

C ^ 
O On ^ 


tN H 


OSi ^ 
u <u 

-a 'E 
c 2 
^ O 


E 5 
o m 

Id S 
<J H 


n .2 .B 

V TJ T5 — H 
qj cd cO lo 

:3««2 2; 


s -a 

.C «d 
CD g 

2 

■- <u 
O TO 
<u cd 

5 2 


2 T 3 


.2 


o 

(0 


c 

o 

D. 


C, 


z 

3 

in 

•J 

> 

nj 

U 

O 

.> 

o 

3 

[denti 

c 

<u 

u 

>. X 

E 

a 

1-^ <N 



'-f 

t-. 

<u 

CU 

2 2 
Ui O 


E c 
o 

'■^ o 

..is 

sD P-, 'x:: 


I 


.E 2 

o 


o 

o 

PQ 


o 


o 

•o 

I 


O ;S 

Ui 'TD 


IM ^ 

S OS 


;S “ 

1 E 

CO cd 
p 0) 

.S2 

> ^ 
c «> 

“1 M 


C/} U 

.2. S. 

Cd 


2 « 
g s 

c o 
3 


00 


(d 


JO 4 > ^ 

2 e 

3 E ^=5 


a> 00 

o cd u 
C (U - — 
•-H u, T3 


*§ S P 

E 2 ^ 


TO ^ 2 

3 cd 
cd ^ TJ ^ 

NO o O S 
H ^ H 


, ■Z' ^ 1 

A •'-« 3 u x: 

- n 3 ^ E I ^ . 

o w ’« .5 00 >. 

’p ^jrOTo’S’^’o 

- _0 (U (U U y, o 

o,^- 52 eE 73 -g 

\o f— 'Z > 5 ? 


E 

bi« 

cu 


o 

PQ 


o 

> 


O ^ x' 

•3 2 .S H 

3 .« 5 s/) -I- 


S.R.A. Reading Science Grades 8-13 40 minf Satisfactory if Percentile norms by grade for total 

Record, 1947 Research Timed rigidly timed and subtests as follows: 

Associates 1 • Rate. 

2. Comprehension, 

3. Paragraph meaning. 



(SR A cintinucd) Reading a d.rectory 

5 Interpretation of maps, charts. 


bl) 

G ^ 
Si 

TD a 
cc 

<U 3 

I- X; 

_ rt 

60 C U) 

D, ^ Tl 

c3 " 3 L» 

“I si 

C 5 c « 
'C I— ( H 


s 

si 

E c 
c 2 

O (J 

c s 

cn O 


"i* 3 
i75 S' 

D 

•O 3 

-rt 
Q* 
O ^ 

H 


E - ^ 


vo r cx: ON O 


u- -- i> 


<— ^ CJ O 


O 2 

o 

00 ^ 

5 

S -S 

tr < 
1-. 

QJ 


o C 
B- rt 

3 «4-( 


1-4 "3 


I! 


o 

o 

« 


o 


c «o 

o ■.- O' 

I? ^ 

I !- 

c X a> 
^ U M 


4S 

00 


7 < c 

;3 a o 

Q. § _ 5 

6 c - 

^ U -J 

UU Q, ^ CT 


E ;i 


^ 

-a 

5 


i H ^ 

U- ^ '— 

a> J 

0—3^ 

sj p: - 


489 


\dcquate coverage of basic biologi- 
eal information contains some ap- 
plication and interpretation of prin- 
ciples 



■§ g ^ 




o 

'o § 3 

^ o g 

g i E 

« t« o 

u :z 

o «« 

§ -> 

^ a g 
f 2 o 

u I o 


ja — H 

E S R 


s *3 

I ‘^is 


o t-’ t! 
u (4 CQ 
Pu Ph 


^ § 2 


^ JZ 
c t; u 
2 y cd 

•g i: 


^si a 
<i> E o, 

,C (L> C3 




11 ^ 


G O 

<U 5 *? 

cd ,— ON 
cd 1— < 

O Wri 

a o - 

8 S !S 
o a H 


O 4> 

C ’O OO 
cd Tf 
O — *-• 1 

cd F— I ^ ^ 

In -d o 

<l> t-i 4<4 

a 0) ^ 

8 S s '=7 

(J O H r!- 


Ih Ji Tj- 

•w y I 

1 = ^] 
,R 


E a“: 

I !■- 

Q£ 2 i 


S Sl'V 

O o ^ 


490 



Read General World Book Grades 9-12 ‘>0 mm Split half 88 Percentile norms 

Science Test, Tuned alternate form Excellent statistical procedures 


'S'g’S 
3 # I 


f-i 

ja qrj C 

a> CO 
O k. kN 

^ Sii o 

a> U) i-i 

O. I- 
^ w 

S o C 


Ct3 C 
u CJ o 
CO JS U) 
CL 


s 

OO 4> 

o c a 
O 5 gi 

C fo 
2 £ m 

tn t/5 

£ X 1 

c n « 

'O TJ -s 

^ g s 

t) 0) t/ 

E E ^ 
< £ 


ill 

fO w) ^ 
i; o I 

1 c H M 

1 uj P 
fl 

<u VI 4^ 

60 c <u 

O O h 


s ed 
C "O >. 

> " < w 

>s a 

*2 w 

0 § 

4) ^ ^ 

1 S E 

^ 3 eo 

E S ^ 

> 2 o 


c 

I S I E 

C C c 

o 

/- VI o 

rt Tt or « 


2 v/3 c3 
P m 


^ iH ffl ^ 

s 5 s " - 

^ Pi O T I 

^ g u rn 

trt'oacv- 
« a O n) On 


a E § ' 

4> r} P > 
Ofi S = 4) 

«> 2 c p 


491 





PUBLISHERS' ADDRESSES 


Acoin Publishing Company, Rockville Centre, N Y 

Buicau of Educational Me isurenicnts, Kansas Stale Tcachuis College oi Lmporia, 
Emporia, Kansas 

Bureau of Educational Rcscaich and Service, State ISiivcisity oi Iowa Iowa Cit}, 
low.) 

Bureau of Publications, leachcis College, Columbia Cnivcrsitv, N. Y 
California Test Bureau, 5916 Holl>wood Blvd 1 os Anetlcs C ahf )rnia 
Educational Test Bureau, 720 Washington Avc S E , Minneapolis, Mii. icsota 
Education il Iv^ating Seivicc, Princeton, N J 
Orune Stratton, Inc , ^81 Fourth Ave , New York, N S 
Harvard Press, 44 Fiancis Ave , Cambridge, Massachusetts 
Houghton Milllin C ompan^ 2 Park Street, Boston, Massachusetts 
Joint Committee ol United Business Association and National Office Managers 
Association, 1 12 West Chcltcn Ave , Philadelphia, Pennsylvania 
Psychological C orporation, 522 Fifth Avc , New York N \ 

Science Research Associ ites, 57 West Grand Avc , Chicagv) Illinois 
Staniord UniveisiU Press, Stanford Calilornia 

State High School lesting Service loi Indi na, Purdue Univcisits, laiayetle, 
Indiana 

World Book Company, ^ I ^ Park Hill Ave , Vonkcis N. Y 


493 



APPENDIX C 


S\MPLr RrPORT CARDS 



■ovTii IN riBw% Ml H n-rr 


1 

— ^ I 


— t" ' 


u 


VTl IS Ul IJil 


H 


[ 




/ Umintai \ 

494 


rAnlNVA^ n* 


fh 





495 











Kiiulcv^urtcn 



fA rNf infMKLNrt- RL' ORO FilJ^NK 


'.CHOm I'Avinii 

Abitnv AN(< iNien^r. and khie.cmu i . ui tiil tmio 


rHIini MO ' IMMIBI.V MllO iPIllf'IU! I DiJlATlONAU SOCIAL 




vh, h r iAN- ri" iHi CMiiii 






Parent-Tcachcr 


Confcrcmc Blank 


496 





hi nun Hii^h 


J ni >r 



Stniot Ihfjh 


Simof Hn:li 




APPENDIX D 


TABLE 32 

Percentaee of Area Under the Normal Curve Between Mean Oidinate 
and Ordinate at Gi%cn z Scoie 





z Score to See 

ond Dccin^cd 

Place 




z Score 

00 

01 

02 

03 

01 

05 

06 

07 

08 

09 

00 

0000 

0040 

0080 

0120 

0160 

0199 

0239 

0279 

0319 

0359 

0 1 

0398 

0438 

0478 

0517 

0557 

0596 

0636 

0675 

0714 

0753 

02 

0793 

0832 

0871 

0910 

0948 

0987 

1026 

1064 

1 101 

1141 

03 

1179 

1217 

1255 

1293 

1331 

1368 

1406 

1443 

1480 

1517 

04 

1514 

1591 

1628 

1664 

1700 

1736 

1772 

1808 

1844 

1879 

05 

1915 

1950 

1985 

2019 

2054 

2088 

2123 

2157 

2190 

2224 

0 6 

2257 

2291 

2324 

2357 

2389 

2422 

2454 

2486 

2517 

2549 

0 7 

2580 

2611 

2642 

2673 

2704 

2734 

2764 

2794 

2823 

2852 

08 

2881 

2910 

2939 

2967 

2995 

3023 

3051 

3078 

^3 1 06 

3133 

09 

3159 

U86 

3212 

3238 

3264 

3290 

3315 

3340 

3365 

3389 

1 0 

3413 

3438 

3461 

3485 

3508 

353] 

3554 

3577 

3599 

3621 

1 1 

3641 

3665 

3686 

3708 

3729 

1749 

3770 

3790 

3810 

3830 

1 2 

3849 

3869 

3isX8 

3907 

3925 

3944 

3962 

3980 

3997 

4015 

1 3 

4032 

4049 

4066 

4082 

4099 

4115 

4131 

4147 

4162 

417/ 

1 4 

4192 

4207 

4222 

4236 

4251 

4265 

4279 

4292 

4306 

4319 

1 5 

4332 

4345 

4357 

4370 

4383 

4394 

4406 

4418 

4429 

4441 

1 6 

4452 

4463 

4474 

4484 

4495 

4505 

4515 

4525 

4535 

4545 

1 7 

4554 

4564 

4573 

4582 

4591 

4599 

4608 

4616 

4625 

4632 

1 8 

4641 

4649 

4656 

4664 

4671 

4678 

4686 

4693 

4699 

4*706 

I 9 

4713 

4719 

4726 

47^2 

4738 

4744 

4750 

4756 

4761 

4767 

20 

4772 

4778 

4783 

4788 

4793 

4798 

4803 

4808 

4812 

4817 

2 1 

4821 

4826 

4830 

4834 

4838 

4842 

4846 

4850 

4854 

4857 

22 

4861 

4864 

4868 

4871 

4875 

4878 

4881 

4884 

4887 

4890 

2 3 

4893 

4896 

4898 

4901 

4904 

4906 

4909 

4911 

4913 

4916 

24 

4918 

4920 

4922 

4925 

4927 

4929 

4931 

4932 

4934 

4936 

25 

4938 

4940 

49fl 

4943 

4945 

4946 

4948 

4949 

4951 

4952 

26 

4953 

4955 

4956 

4957 

4959 

4960 

4961 

4962 

4963 

4964 

27 

4965 

4966 

4967 

4968 

4969 

4970 

4971 

4972 

4973 

4974 

2 8 

4974 

4975 

4976 

4977 

4977 

4978 

4979 

4979 

4980 

4981 

29 

4981 

4982 

4982 

4983 

4984 

4984 

4985 

4985 

4986 

4986 


0 4987 

3 5 4998 

4 0 49997 

5 0 4999997 


498 



lABLF 33 


Several Types of Score as Related to i Normal Piobabdity clistnbution ^ 



Ihc Ps>ehologJC il ( oiporation, Test StrMce Bulletin No 47 (September 1954), 

p 6. 


499 

























INDEX 


Ability grouping, 431-412 
Achic\crncnt tests 46^-475,480-48? 486- 
4^2 

also specific school subjects) 
Accniacv, as a dimension of pel toimance, 
330 

Activity subjects v w t Performance aclivitv 
areas ) 

Administration of tests, 118 121 
Agcnoims, 1*^9 
Algebiatsd Mathematics) 
example items, 32 1 

Anecdotal torms md proccdiiies, '^3-54 
\nthmclie Mithematics) 
example items, 320 
Arithmetic mean, 144 149 
eomparison with median, 149 150 
outline for computing 148 
\rrarigcment of tlemcnts items 
in gcncial, 85, 86 88 
use and const! action, 1 12 
\rt 

dimensions, 333 
evaluation in 356 
example of lating chait, 345 
Attitudes and interest 
dimensions, 390 391,408-409 
evaluative standard ^ 390 391, 414-415 
example of an opinion scale, 285 
foims of measuicment symbols, 390 391 
measuring pi ocedui'^s, 390-391. 409 414 
self icportmg devices 409-4 1 1 
tests of opinion, 411 -414 
Attributes Dimensions) 

Average (see Arithmetic mean ) 

Average deviation (sa Mean deviation ) 
Ayres Handwriting Scale, 252, 472 

Behavioi rating scales, 56-57, 416-418 

Beliefs, 388, 389, 391 

Bias 

cultural and verbal bias in intelligence 
tests, 374- 376 
defined, 4 5 


Bias (Cont ) 

as a factor m observational evaluations, 
61 

Business education 
dimensions, 335 
evaluation in, 356-357 
examples of a pi \n for evaluating a tvp- 
ing performance, 349-350 

Central tendenev, measures of, 139-150 
( haractei and citi/enship (iu Persoiialilv ) 
defined, 387 

dimensions, 393, 415-416 
evalu itivc practices, 419 
evaluative standards, 418-419 
forms ol measurement svrnbols, 416 
mcasuimg pitKcduies, 416-418 
( haras teiisties (ice Dimensions) 

C heck lists, 52 53 

foi peifoimanee tests, 342 343 346 
Citizenship (ve£ Chaiactei and citizenship) 
as a factor in subject gia hng, 275 
C 1 issihcdtion symbols 9-11 
C oherene> as a dimension of performance 
331-332 

( ornpletion ir fill in items, 110 111 
Components, as dimensions of achieve- 
ment, 29 

Composition 238-250 
composition tests 240-243 
dimensions, 238-239 
evaluative standards, 248-2 ‘'O 
foims ot incdsuicment symbols, 239-240 
marking compositions, 240 
mechanics of grammar tests ”^43-244 
procedures for measuring, 240-248 
spelling tests, 244-246 
tests of rhetoric or effectiveness, 246-248 
voe ibiilarv tests, 246 
C oncomitanl vai lation, 173 
C onstructs, 26-28 
Coi relation 172-181 
coefficient of, 174-181 
as concomitant variation, 173 


501 



502 


INDEX 


Con elation (Cont ). 
intciprctation of, 180-181 
ds a ratio, 178 
Cost of testing, 428 
C ritenon 
defined, 45 
gioups, 405 

m Vdiidatmg tests, 124, 186 
Cultinal bias in intelligence tests, 

C umulative percentage ciine, 1^6-1 *'7 
C umulative records, 429 4M) 

Description as i foim of mea^uicment, 
11-12 

DcMation, 144-145 
Diagnosis of dis ibilUies 

achievement in general, 425 426 
aiithmetic, 514-31 5 
handwiiting, 252 
reading, 254 255 
social studies, 277 278 
speech, 552 354 
spelling, 246 

Dillieiilty of test Items, 100 101 115 
Dimensions 
art, 333 

attitudes and interests 40S 409 
business education, 555 
citizenship, 595, 415-416 
common to educational phenomena 
29 -"^ 2 

composition, 238-259 

criteria for measui ability, 2(t-24 

defined, 20 

handwriting, 251 

home cconomu s 536-557 

industrial aits, 515 

inferred, 26-28 

intelligence, 562-566 

language arts, 220 

list of educational dimension 17 18 
literature, 25 '1-2 5 5 
mathematics, 509-5 1 1 
music, 554-355 

peifoimance-aelivit> subieets in gcneial, 
550-532 

personality and chaiactci, 388-594 
physical education, 557-558 
principles in selection of, 28-29 
reading, 228-250 
reading leadincss, 222-223 
science, 296-2^9 
social studies, 267-269 
speech, 236 

Directions, m testing, 116, 119, 120 
Director of a testing program, 434 
Discrimination, as a dimension ot perform- 
ance, 331 

Discrimination m lest items, 98-100 


Distiibutions, types of, 138 
Drawings, as personality projections, 396, 
599 

Diiving peiformancc check list, 546 

r eonomy of effort as a dimension of per 
formance, 551 
Lducation tl phenomena 
common dimensions of, 29- '^2 
for which measurement miy he desired 
17-18 

prepaiiiig for measuicmcnt, 17-5 5 
I- fhcicncy of a mcdsiiring pioccdL'ic, 45 44 
Liiglish, 758-250 255-262 
analytic evaluation of achievement, 261 
262 

1 nor, as a dimension of achievement, 50, 
5^0-551 

Fssa> examinations questions, 65 
{set Fiee response proeeduie ) 

F\alu ition [sec Pvaluatnc stand nds) 
defined, 2, 190 

as a direct outcome of obsci valions, 50, 
51 

as distinguished tiom nu isuiement 190 
192 

ev dintive standards, 192 
evimplesof 198-205 
steps in, 196 198 
svnibols of, 195 196 
(o sptcilie school subjects) 
f valuative slandaids (stc F valuation) 
ait 556 

alliludcs and inteiests 590 591 414-415 

business subjects 556 5s7 

citizenship, 595, 418 419 

composition 248 250 

eoiiisc obKctJVLS isstiiidiids 272 ?''5 

enter u tor, 195 194 

distribution stand iids ]94 

emotion d st indirds 19"^ 

example for history, 279 280 

example toi reading, 199, 202 

handwiitnig, 252-25 ^ 

industiial arts, 357 

intelligence, 578 

language aits, 221 222 

levels of understanding as, 204 

liter aturc, 259-260 

mathematics, 315-319 

music, 556 

nature and souiee of, 192-195 
percentage standards, 194-195 
performance activity subjects, 554-355 
performance scales as standards, 203-205 
peisonalrty and character, 403-407 
ph>sical education, 357-358 
reading, 235 

reading readiness, 225, 227 



INDEX 


503 


Evaluative standards {Cont ) . 
science, 299-301 

used in selecting dimensions foi meas 
urement, 29 
social studies, 212-11 A 
speech, 237 

teacher opinion as a standaid, 274 
textbooks as standards, 273 
Experience as a factor in leading readiness, 
224-225 

Factorial scoring, 76-81 
factor counting, 77-79 
factor rating, 76-77 
factor weighting, 79-81 
factors {set Dimensions) 
fears, 388, 389, 191 
1 orms of measurement symbols 
attitudes and intcicsts, 390 391 
citi/cnship, 393, 416 
classification dcscnption s\mhols 9-1 
compositnin, 2^9-240 
ciitena for selection n 14 
free i espouse tests, 73 
guided response tests 89 
handwriting, 25 1 
intelligence, ^66-368 
language aits, 220 
Iitci itinc, 25s 
mathematics, 319 
observ itioii 50 

pel form mee activity subitels, 338 

personility md ch itaekr 394 

product analysis, 73 

rank symbols, 8 9 

reading, 2^ 1-234 

reading leadiiiLSs 223 

scale symbols, 6-8 

science, 301 

social studies, 270 

speech, 217 

1 lec response proccduies 
in general, 69 83 
applicability, S2 
cc^mposition, 240-242 
example in eighth gi idc I S Ilistoiv 
class, 286, 289 290 
forms of nicasuicmcnt symbols 1 
iiandwiiting, 251-252 
literature, 257-258 
mechanics, 244 

methods of eheiiing fitc usponscs, 69 *^1 
methods of scoring fice i espouses 66-67, 
73-82 

personaliiy and chaiactcr, 396-397, 398- 
399 

reading, 230 
I hctoi ic, 248 
social studies, 270-271 


Fice-response procedures {Cont*) 
speech, 352-354 

Frequency, as a dimension of achievement, 
30 

Fiequcncy distribution, 133-134 
types of, 138 

Fiequency interxals, 131-133 
Frequency polygon, 136-137 
smoothed curve, 137-138 
Frequency table, 131-134 
outline for construction, 133 

Geometry (set Mathematics) 
example items, 322-324 
Grade norms, 159 
Grammar, 243-244 
Giaphic rating scales, 5(s, 417 
Guess Who tests, 400-401 
Guessing, coneclions for, 1 17-1 18 
Guidance, uses of testing in, 429-43) 
Guided response pioccduics 
m general, 85 S9 
administration of tests, 118-121 
algebra, 321-322 
arithmetic, 320 

attitudes and interests, 409 414 
composition, 242-248 
constiuction of guided response tests, 
90-118 

defining phenomena and ^Jimensions to 
be measured, 91-94 
foims ot measuiemcnt symbols 89-90 
geometry, 322-324 
guessing, 116-118 
Item sequence, 1 14 
literature, 256-259 
mechanics of gi immar, 2 t3 244 
pcisorialily andchaiacter 397-398 
pieparation of items 94-1 i2 
reading 230-234 
leiding rtadincss, 225, 226 227 
fhclonc 246 248 
science inform ition, ”^01 
scieplific ’hinking "02-306 
scoring cindtd response tests 98, 116 
social linhis 27()-'^'^2 
spelling, 244-246 
standaidi/ed tc^ts, 121-126 
lest directions, 116, 119 
lest length, 113-1)4 
timed tests, 115 116 
vocabiilaiy, 246 

FI indwriting, 250-25 1 

evaluative stanoards, 252-251 
handwriting scales, 251-252 
Mealing, as a factor in reading leadines 
224 



504 


INDEX 


Histogram, HS-HO 
outline for constructing, 1 35 
Home economics 
dimensions, 336 

example of a rating form, 350-35 1 

Industrial arts 
dimensions, 335 
evaluation in, 357 
example of a rating form, 349 
Instrument, defined, 44 
Intellectual abilities and skills, classification 
of, 31 

Intelligence, 361 38*^ 

administering and scoring intelligence 
tests, 377-378 

cultuie bias in mcasunng, 37"^ 376 
defined, 361-362 
dimensions, 26, 362-366 
dimensions commonly measured, 364- 
366 

dimensions inheient in tests, 363-364 
evaluations of, 378-379 
group tests, 371 

heicdity and environment in 379 
indexes of intelligence, 366 368 
individual tests, 37 1 
list of tests, 475-479 

measures of, related to school piacticc, 
379-381 

mental age (MA), 366 
percentile ranks, 367-368 
profiles, 371-372, 373 
relation to leading readiness, 223 224, 
227 

reliability and validil) of intelligence 
tests, 372-370 

relation to school marks, 380 
specific uses of measures of, 381 
test items, 363-364, 369-371 
tests, 368-378 
theories of, 362- 363 
verbal bias in measuring 374-375 
Intelligence quoticnt(s) ( IQI), 367 
deviation IQ’s, 367 

distribution in an unselccted population, 
168 

parental interpretation of, 380-38 1 
ratio IQ’s, 367 
unreliability in, 368 
variability among tests, 376 
Intensity, as a dimension of performance, 
331 

Interests {see Attitudes and interests) 
Interpercentile range, 151-152 
comparison with standard deviation, 154 
Interpreting data, 305 

Interquartile range (w Quartile devia- 
tion), 152 


Intel vals (see Frequency intervals) 

Item(s) ( sf'f Test items ) 

Item analysis procedures, 98-101 
dilliculty, 100-101 
discrimination, 98-100 

Job analysis, 341 

Knowledge, classification of, 31 
Knowledge and understanding 

generalized evaluative standard, 204 
lileiatLire, 256-257 
mathematics, 310-312 
scienc* 297-299 
social studies, 267-269 270 271 

Labeling items, 87, 11 1 112 
Language arts, 219 265 
composition, 238 2^0 
dimensitms 220 

Lnglish in sccondaiv gi ulcs 260-262 
evaluative stand lid'* 221 
foims of measiiicmcnt s>mbols, 220 
handwriting, 250 25“^ 
liteialurc, 25t-260 
milking pi acticts 222 
mechanics, 24"^ ^41 
procedures of me isiiiemcnl, 220-221 
reading, 222 2 ■»6 ^ 

ihctonc, 247, 24 S 
speech, 236-238 
spelling, 244 246 
vocabulai>, 246 
Literature, 253 260 
dimensions, 253-255 
evaluative stiind irds, 259 260 
forms of measiircinci I symbols, 255 
literary appreciation 2^1 259 
procedures for measuring literary knowl- 
edge, 256-257 

Logical stiucturc HI 312 317 H8 

Man to-man rating scales, 417 418 
Map and chart work, 267 270 
Marking and reporting 
in geneial, 205-212 
art, 356 

business subjects, 356-357 

citizenship, 275, 404, 407 

composition, 250 

criteria for, 2 1 0-2 1 2 

function of school mai ks, 205-206 

handwriting, 253 

industrial arts, 357 

language aits, 222 

literature, 259 260 

mathematics, 3 1 3 

music, 356 

parent-teacher conferences, 209 



INDEX 


505 


Marking and reporting {Coni ) 
physical education, ^57-358 
practices in, 207-210 
‘psychological” maiking, 275-276 
reading, 235-236 
report card examples, 494-497 
single Icltci marking, 207 -208 
social studies, 274-276 
speech, 238 

Matching items, 86, 109 
Mathematics, 309-326 

compared with science, 294 
dimensions, 310-313 
evaluative standards, 313 319 
forms of measurement symbols, 319 
measuring procedures, 3 19-324 
nature of, 309 -3 1 0 
standardized tests 324—325, 480-482 
Mean, 144-149 

comparison vMth median, 149-150 
outline for computing, 148 
Mean deviation, 15'^-153 
Measurability, conditions of, 20-24 
{sec Dimensions) 

Measuiemcnt and evaluation 
definition, 2 

factors iffeeting validity and efficiency, 3 
Measurement symbols {sk foims of), 
^-16 

Mcisunng proccduits 
111 pcneial 34 47 

attitudes and intciesls, ^90 391 409 414 

basic propel tics of 35 40 

citizcnsl n, 393,416 419 

composition 239 248 

criteiia for, 40 44 

eflicicncv of, 43 44 

free icsponse tt^sts 63 83 

function of, 

gukled response tests, 84-126 
handwnling 2^1-252 
intelligence, 368-3 ^ ’ 
language arts 2^0 
litciatuie, 2>(> 259 
mathem tties, 319-325 
observation, 48-61 

pcifoimmcc activily suhicets in general 
338-344 

person liity and chai letci 390-403 
pioduct analvsis 63 83 
reading 230-235 
reading leadincss, 223 225 
icliabilit> of, 42-43 
science, 301-308 
social studies, 269-272 
speech, 237 
validity ofy40-42 
Median, 140^-144 

comparis^% with mean, 149-150 


Median (Cont ) ‘ 

outline for computing, 143 
Mental age (MA), 366 
Mental maturity {vee Intelligence) 

Mode, 1 39 

Model (in lest construction), 91 
Multiple choice items, 86, 108, 109 
Muiray pictures, sample, 71 
Music 

dimensiohs, 334-335 
evaluation in, 356 

example of performance test, 347- 348 

Norm(s) 

ige norms 159 
defined, 45 

determining, 158-160 
gauging class progress b>, 433 
glade norms, 1 59 
percentile noims, 159 
precautions in using, 160 4^2-433 
stand ird scoie norms 159 160 
Normal piobabilily cu’ve, 162-169 
applications, 166-172 
area score relationships, 498 
defined 165 

related to several type' of score, 499 

Objective tests, 84 

(vrt Guided response procedures) 
Objcelixcs {s(t( Dimtnsions) 

Observation, 48-62 

a*- a basic pioecduie of measurement 
48 50 

anecdotal forms and proeeduies, ^3 54 
check lists, 52, 53 
citizenship 416-419 
dc vh cs used in, 52 59 
driver tr ining, 346 
evdJuatJon is a direct outcome >0, 5J 
forms of mcasiiicmcnt symbols appropn 
ate to 5(1 
handwnling, 252 
home economics, 343 
industrial iits, 348, 349 
litci ilure, 258 
music 3 t7-348 

pcrfoimancc activity subjects, 338-345 
peisonalit> and character, 394-395 
physic il educ ition 343, 351 352 
piimiplcs ot vahdiiv and reliability in, 
59-61 

rating scales, 54-59 
rea ling, 235 
leading reidmess, 224 
science, 306-307 
social studies, 270-271 
speech, 237 
Ogive, 156-157 



50 ^ 


INDEX 


Open-end statements, questions, 70 
Oral language, a dimension ot reading 
readiness, 224 

Paired comparison items, 412 
Peer ratings, 99-403 
Percentile, 151-152, 156-157 
Pei centile rank, 156 157 
comparison with standard stoi^* 
noims, 159 

Performance activity are is 329-360 
dimensions, general, 330-332 
evaluative standards, 354 358 
examples of mcasuiing proccduies 344 
354 

forms of mcasLiiemtnt symbols, 338 
measuring protcduics, 338-354 
performance lest, 339 344 
piocess and product, ^29-330 
Pciformance tests, 339-354 
Personality and eh ii leter, 386-423 
in gencial, 386-407 
analysis of drawings 396 
attitudes 390,407 415 
eharaclei attributes, 393 
clinical definitions of maladjustment, 
406 407 

criterion gioups, 405 
defined 387 
dimension^ 388-394 
evaludtise practices, 407 
evaluative standaids, 403 407 414 415 
fears, beliefs, and values 388 389 391 
foims of measuicment symbols, 390 393 
394 

Guess Who tests, 400 401 
interests, 390-391, 407-415 
mcdsui mg procedures, 390 393 394 403 
opinion tests, 397-398 
outline of dimensions, meisuicmtnt 
forms, proLcdures and evaluative 
standards, 390-393 
peer ratings, 399 403 
personality strueture, 389, 392, 394 
projective techniques, 71 398-399 
soeiograms, 401-403 
standaidizcd tests, 483 486 
Personality structure, 389, 392 394 
personality tr ills 389 t92, 394 
personality types, 389, 392 
polar dimensions, 389, 392 
Physical education 
dimensions, 3^7 
evaluation in 357-358 
example of a plan for evaluating divmg 
351 

Preparing phenomena for measurement, 
example of, 91-94 
Probability, 163-164 


Problem questions, 72 
Piobicm solving 

as a factor m intelligence, 364, 365, 369, 
370 

mathematics, 312-313, 317-319 
science 297-299, 302-306 
social studies, 269, 271 
Product analysis 
in gcncidl, 69-8^ 
applicability, 82 
art, 345-346 
clothing, 350-35 I 
composition 240 

foims d measuicment symbols pcitinenf, 
73 

handwriting 251-252 
mechanics 244 

methods of scoring prcxluct , 64 65, 73- 
82 

personality and charactci 39^ 396 
rhetoric, 248 
science, 306 307 
social stLidic 2'^() 
spelling 244 
typinc, 348 349 
vocabuldiy 246 
Product sc lies 75 76 

Piohle(>), 125-126, 171-172, 371 372 373 
Piojeetive technique 71,398 39‘) 
Pioperlies (sft Dimensions) 

Pi O' ide an answer items 
in genci il 85 87 
use and eonsti net ion 110 112 
Psychological gliding or maiking 275- 
276 

Publishcis adch esses 493 

(Qualities (sff Dimtnsion%) 

Ouantilative thinking 311 
Quart lie dcvi ition ( Q ) 152 
Quest lonn iires (\^^ Sell lepoitmg proce 
duies) 

Ringe 150 15| 

R ink order 1 5 5 
R ink symbols, 8 9 

Ranking products and fiec responses, 74- 
75 

Rate is a dimension of ichievtment, 30 
Rating products and tice i espouses, 74 
Rating scales 
in general, 54-59 
category scales 55 
in citizenship, 4 1 6-4 1 8 
continuum scales, 55, 56 
criteria for design and use, 59 
graphic or descriptive scales, 56, 417 
illustrated, 56-57 
man to man, 417 



INDEX 


507 


Rating scales (Coni ) : 

measurement symbols derived from, 58 
number of categories or intervals in, 58 
for performance tests, ^45, ^^49, 

350-351, 352, 353 
with unequal intervals, 57-58 
Reading, 228-236 

diagnosing disabilities, 234-235 
dimensions, 228-230 
evaluative standards, 235-236 
measuring procedures in geneial, 2^0 
reading test scores, 233-234 
reading tests, 230-234 
Reading readiness, 222-227 
dimensions of, 222-223 
evaluative standards. 227 
forms of measurement symbols, 223 
measuring procedures, 223-225 
Reading readiness tests, 225-227 

correlation of scores with teacher latings, 
225 

illustrated, 226 227 
Recognition tests, 339-340 
Recordings 

used as standards in music, 356 
used as standards in speech 23 / 

Referral for individual setting 430-4 U 
Regression line 177-180 
Relative position, 155 158 
Reliability, 42-43, 181-186 
cocffic lent of 43, 185 
definition of, 42, 181-182 
fisherman s lulei analogy 42-43 
interpretation of, 185-186 
souiccs of maccui itc mcasuicinent IS'^ 
standard error of me isiii Lincnl 1 8 5 
of standardized tests 124 
statistical pioceduics foi dLlci mining 
183-185 

Report Ciirds, 205 212 
examples, 494-497 
Representative measures, 1^9-150 
Rhetoric, 246-248 
Rorschach ink blot, sample, 71 

Sampling 

categorical, 39 

related to guidevl response proteduics 
102-104 

related to measuiing procedures 38 40 
temporal, 40 
Sampling error, 169-170 
Scale symbols, 6-8 
Scatter diagram, 174-175 
School marks {sec Marking and rcpoiting) 
School-widc testing programs, 424 438 
ability grouping, 431-432 
administering tests, 427 
cost, 428 


School-wide {Cont ) 

cumulative records, 429, 430 
director of a testing piogram, 434 
focal points of, 424-425 
frequency of testing, 432, 434-435 
guidance, 429-431 
handling results, 427-428 
instructional facilitation, 432-433 
instiuments and procedures, 425-429 
locally devised tests, 428 
ordering tests, 427 
reasons for, 424 

referral for individual testing, 430-431 
selection of tests, 426 
scope of testing, 434-435 
scoiing tests, 427 

tenets of an efficient testing program, 
433-4^5 
uses of, 429-433 
vocational guidance, 4^0 
Science, 295-308 
dimensions, 296-299 
evaluative standards. 299-301 
lorms of measurement symbols 301 
instruction in schools, 295-296 
me isurmg procedures 301-308 
nature of science, 295-296 
rcl ition with m ithematics, 294 
special measurement problems 308 
standardized tests 307-3(78 489-490 
Scientific thinking V)2 305 
Scoring 
defined 45 

electiical or machine, 98 1 In 
f ic tor counting "7-79 
factor rating, 76-^7 
faclori d scoring, 76-8 1 
ficc response ( essay ) tests, 66, 67, 73- 
82 

guided icsponsc tests, 98 116 
keys, 1 16 

pioduLts 64-65, '^3-82 
ranking products and free icsponscs, 
74 

r.iting products and tree rtsponsts, 74 
self reports 410-411 
m testing progT ims 427 
tests of opinion, 114 
use of a piodiic^ sc dc, "6 
weighting m, 79-8 1 
Selection of an answer item 
ingcncril 85 86 
use and construction ^06 1 10 
Self-icporting procedures, 396-397, 409- 
41 1 

Scnsoiy data, 24 

Shoit answer items 87, 111 

Simulited task oi pcrfoimancc 

as a basic measurement concept, 41 



508 


INDEX 


Simulated task (Cont ) : 

as a test procedure in performancc-atliv- 
ity subjects, 340 
Social distance, 411,412 
Social studies, 266-292 

evaluative standards, 272-274 
foims of measurement symbols, 270 
marking and reporting, 274-276 
measuring procedures in geneial 270 
272 

phenomena and dimensions in general, 
267-270 

soeial studies units, 276-277 
Sociograms, 40 1 403 
Sociometry, 399 403 
Speech, 236-238 
dimensions, 236 
evaluative standards, 237-238 
forms and proetdiiies of me isuiement 
2^7 

marking ichicvcmcnt in, 238 
recordings as evaluative stand irds, 237- 
238 

Speed, as a dimension of pcrfoiinaiice, ^0 
330 

Spelling, 244-246 

Standard analysis systems, 37 s8 

Standard deviation I S 3-1 54 

comparison with interpciccnlile lange 
154 

outline for computing, 1 54 
Standard differcntnil responses, ^6 37 
Standard error of measurement, 169 172 
185 

as 1 elated to reliabilit> eotflicienl, 185 
as related to score prohles 1 72 
Standard score, 157-158 

comparison with percentik rank, I 58 
norms, 159 

Standaid stimulations, 35-36 
Standardized, defined, 44-45 
Standaidi/ed tests and testing 
in general, 121 126 
abiIit>-gioupmg 431 432 
achievement bailcncs, 463-467 
annotated bibliography, 463-492 
art, 467-468 

attitudes and intcicsts, 414 
business, 468-469 
cost of, 428 
English, 470-471 

evaluating standardized tests, 123-125 
foreign language, 472 
guidance, 429-43 1 
handling results, 42"^ -428 
handwriting, 251-252, 472 
health, 473 

home economics, 473-474 
instructional facilitation, 432- 4 n 


Standardized tests (Cont ) 

intelligence, 368-377, 475-479 
language arts, 22 1 
Iitciature, 256-257 
mathematics, 324- 325, 480-482 
music, 482 
oiiloi ing, 427 

personality and characlei, 397 399, 483- 
486 

leading, 231-234, 486-489 
reading re idiness, 225, 226-227 
icstriwtions on use, 433 435 
science 307-308,489 491 
scoring 127 
selection of, 426 

social studies, 271, 272, 491 492 
soul CCS, 122-123 
spelling, 245-246 
uses of, 429-43^ 
vocabulaiy, 246 
wood working, 475 
Stencil key, 98, 117 
Stimulus pictuies, 71 72 
Stimulus woids ind obicets 70 71 
Stones to complete, 72 
Sti ip kev, 1 1 7 
Sludv habits, 286-289 

obscrv ition schedule for 288 

J ibul ir and giaphical poili i\ il e>t mcas 
uicmcpts, 130 139 

laxofonn of F Jucational Ohn(tni\ 31 
Testts) 

in vaiious ^ubfcct subjcci cnliiis) 

attitudes md interests, 409 114 
define 1, 44 

tree iC'pc^nse tests, 69 83 
(.uess Who tests, 400 401 
guided response tests 85-176 
intelligence tests 368 377 
perform ince tests, 339-354 
person ility chai ictci, 396 ^99 
projective tests, 71, 398-399 
recognition tests, 3 39 340 
tandaidi/ed tests, PI 126 463 492 
testing programs, 424 436 
♦ests of opinion, 41 1-414 
woik s.implc tests, 340-344 
Test items 

chance factor of, 104, 105 
defined, 44 

drfiiciilty, 100-101, 104, 105, 115 
(list rimination, 98-100 
format tor responses, 98 
function of, in guided response tests, 94- 
96 

language of, 96 97 

need for independence, 101- 102 

sampling function, 1 02-104 



INDEX 

Test Items {Cont ) . 
sequence of, 114 

types of guided response items, 104-112 

Time, ds a dimension of a^hicscnunt, 29- 
30 

1 imed te.ts, 115-116 

Timing, as d dimension of ptifotminct, 
331 

r rend line, 177-180 

True-false items, 86, 107 108 

Understanding {see Knowledge ind under- 
standing) 

Understanding, levels of 

example m mathematics 314-3 1 S 

example in science, 300 

example in social studies, 279-280 

as a generalized cvaluatisc standard *^04 

Validity 
defined 40-41 
factors bearing on 41 42 
of standardized tests, 124-125 
statistical estimates of, 186 187 

Values 388-389, ^91 


509 

$ 

Variability of groups, measures of, 1 50-1 55 
mterpercentile ranges, 151-152 
mean deviation, 152-153 
quartile deviation, 152 
range, 150-151 
standard deviation, 153-154 

Vai lance 
defined 153 
eiror, 178 
explained, 178 

as lelated to eoi relation, 177-181 
total, 177 

Vision as a factor in reading readiness, 
224 

Vocabular>, 246 

Vocational guidance, 430 

Weighting 

difiiCLilty weighting, 80 81 
importance weighting, 79-80 
in scoring pioducts and fite r >ponses, 
79-81 

Woik sample test, 340 344 

* seoie (uf Standard score) 



